[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte
From: |
Markus Mützel |
Subject: |
[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters |
Date: |
Sun, 3 Mar 2019 12:03:08 -0500 (EST) |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0 |
Follow-up Comment #4, bug #35910 (project octave):
This is still the case with Octave 5.1.
As a work-around, it is possible to start the pattern with "(*UTF8)" with PCRE
7.9 and newer (Apr 2009):
>> string = regexprep('§x', '(*UTF8)^(.)', '$1;')
string = §;x
>> fprintf('%x\n', string)
c2
a7
3b
78
I didn't see a performance impact with a primitive test:
>> tic, for i=1:1e5, string = regexprep('§x', '(*UTF8)^(.)', '$1;'); end,
toc
Elapsed time is 1.20157 seconds.
>> tic, for i=1:1e5, string = regexprep('§x', '^(.)', '$1;'); end, toc
Elapsed time is 1.23969 seconds.
Should we add this automatically? Or at least document it in the docstring for
"regexp"?
I can't find a relevant thread in the mailing list archives.
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/bugs/?35910>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
- [Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters,
Markus Mützel <=