octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte


From: Markus Mützel
Subject: [Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters
Date: Sun, 3 Mar 2019 12:03:08 -0500 (EST)
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0

Follow-up Comment #4, bug #35910 (project octave):

This is still the case with Octave 5.1.
As a work-around, it is possible to start the pattern with "(*UTF8)" with PCRE
7.9 and newer (Apr 2009):

>> string = regexprep('§x', '(*UTF8)^(.)', '$1;')
                                                                              
    
string = §;x
>> fprintf('%x\n', string)

c2
a7
3b
78


I didn't see a performance impact with a primitive test:

>> tic, for i=1:1e5, string = regexprep('§x', '(*UTF8)^(.)', '$1;'); end,
toc
Elapsed time is 1.20157 seconds.
>> tic, for i=1:1e5, string = regexprep('§x', '^(.)', '$1;'); end, toc
Elapsed time is 1.23969 seconds.



Should we add this automatically? Or at least document it in the docstring for
"regexp"?

I can't find a relevant thread in the mailing list archives.

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?35910>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]