[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte
From: |
Burkart Lingner |
Subject: |
[Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters |
Date: |
Tue, 20 Mar 2012 16:29:26 +0000 |
User-agent: |
Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:11.0) Gecko/20100101 Firefox/11.0 |
URL:
<http://savannah.gnu.org/bugs/?35910>
Summary: Incorrect regex matching of multi-byte UTF-8
characters
Project: GNU Octave
Submitted by: burkart
Submitted on: Tue 20 Mar 2012 04:29:25 PM GMT
Category: Interpreter
Severity: 3 - Normal
Priority: 5 - Normal
Item Group: Incorrect Result
Status: None
Assigned to: None
Originator Name:
Originator Email:
Open/Closed: Open
Discussion Lock: Any
Release: 3.6.1
Operating System: GNU/Linux
_______________________________________________________
Details:
When matching a single character at a position where there's a multi-byte
UTF-8 character, only the first byte is matched. Depending on how this match
is then processed, it can result in invalid UTF-8. Example:
string = regexprep('§x', '^(.)', '$1;')
fprintf('%x\n', string)
yields
string = ?;?x
c2
3b
a7
78
where "?" is the replacement character and the UTF-8 codes for "§", ";", and
"x" are "0xC2 0xA7", "0x3B", and "0x78", respectively.
The expected output would have been
string = §;x
c2
a7
3b
78
_______________________________________________________
Reply to this item at:
<http://savannah.gnu.org/bugs/?35910>
_______________________________________________
Message sent via/by Savannah
http://savannah.gnu.org/
- [Octave-bug-tracker] [bug #35910] Incorrect regex matching of multi-byte UTF-8 characters,
Burkart Lingner <=