octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness


From: Andrew Janke
Subject: Re: regexp strangeness
Date: Tue, 11 Feb 2020 00:08:02 -0500
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Thunderbird/68.4.2


On 2/8/20 10:57 AM, Kay Nick wrote:

> On 08.02.20 15:01, Andreas Weber wrote:
>> Am 08.02.20 um 12:47 schrieb Kay Nick:
>>> the documentation to regexp says:
>>>
>>> '\w'
>>>           Match any word character
>>>
>>> what exactly is a word character (maybe even more important what isn't)?
>> It's always worth to have a look at the underlying library, PCRE in this
>> case: https://www.pcre.org/original/doc/html/pcrepattern.html
>>
>> ...A "word" character is an underscore or any character that is a letter
>> or digit. By default, the definition of letters and digits is controlled
>> by PCRE's low-valued character tables, and may vary if locale-specific
>> matching is taking place (see "Locale support" in the pcreapi page). For
>> example, in a French locale such as "fr_FR" in Unix-like systems, or
>> "french" in Windows, some character codes greater than 127 are used for
>> accented letters, and these are then matched by \w. The use of locales
>> with Unicode is discouraged. ....

Matlab compatibility note: in Matlab's regexp() functions, the \w
metacharacter appears to match any alphanumeric character in any script
within Unicode, not just the ASCII-compatible '[a-zA-Z0-9_]'. Over
there, it seems like \w is equivalent to '[\p{L}\p{N}_]'.

The Matlab documentation is not very explicit about this, and its
wording is a little muddled.

Sounds like maybe Octave should be running PCRE in Unicode mode, and
compiling its patterns with the PCRE_UCP option set?

Cheers,
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]