Re: regexp strangeness

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness

From:	Andrew Janke
Subject:	Re: regexp strangeness
Date:	Tue, 11 Feb 2020 00:08:02 -0500
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Thunderbird/68.4.2

On 2/8/20 10:57 AM, Kay Nick wrote:

> On 08.02.20 15:01, Andreas Weber wrote:
>> Am 08.02.20 um 12:47 schrieb Kay Nick:
>>> the documentation to regexp says:
>>>
>>> '\w'
>>>           Match any word character
>>>
>>> what exactly is a word character (maybe even more important what isn't)?
>> It's always worth to have a look at the underlying library, PCRE in this
>> case: https://www.pcre.org/original/doc/html/pcrepattern.html
>>
>> ...A "word" character is an underscore or any character that is a letter
>> or digit. By default, the definition of letters and digits is controlled
>> by PCRE's low-valued character tables, and may vary if locale-specific
>> matching is taking place (see "Locale support" in the pcreapi page). For
>> example, in a French locale such as "fr_FR" in Unix-like systems, or
>> "french" in Windows, some character codes greater than 127 are used for
>> accented letters, and these are then matched by \w. The use of locales
>> with Unicode is discouraged. ....

Matlab compatibility note: in Matlab's regexp() functions, the \w
metacharacter appears to match any alphanumeric character in any script
within Unicode, not just the ASCII-compatible '[a-zA-Z0-9_]'. Over
there, it seems like \w is equivalent to '[\p{L}\p{N}_]'.

The Matlab documentation is not very explicit about this, and its
wording is a little muddled.

Sounds like maybe Octave should be running PCRE in Unicode mode, and
compiling its patterns with the PCRE_UCP option set?

Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

regexp strangeness, Kay Nick, 2020/02/08
- Re: regexp strangeness, Daniel J Sebald, 2020/02/08
  - Re: regexp strangeness, Andrew Janke, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/08
- regexp strangeness, Kay Nick, 2020/02/08
  - Re: regexp strangeness, Andreas Weber, 2020/02/08
    - Re: regexp strangeness, Kay Nick, 2020/02/08
    - Re: regexp strangeness, Andrew Janke <=

Prev by Date: help with interleaved complex data in mex files
Next by Date: Re: help with interleaved complex data in mex files
Previous by thread: Re: regexp strangeness
Next by thread: Documentation on sources
Index(es):
- Date
- Thread