Re: regexp strangeness

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp strangeness

From:	Andrew Janke
Subject:	Re: regexp strangeness
Date:	Sat, 8 Feb 2020 13:28:56 -0500
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Thunderbird/68.4.2


On 2/8/20 1:07 PM, Andrew Janke wrote:
> 
> 
> On 2/8/20 4:12 AM, Daniel J Sebald wrote:
>> On 2/8/20 3:32 AM, Kay Nick wrote:
>>> Hey all,
>>>
>>> the documentation to regexp says:
>>>
>>> '\w'
>>>            Match any word character
>>>
>>> what exactly is a word character (maybe even more important what isn't)?
>>> Am I right in assuming its
>>> [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]? What about non
>>> english characters like öäßłńŚ?
>>
>> https://en.wikipedia.org/wiki/Regular_expression#Character_classes
>>
>> lists \w as the equivalent to [A-Za-z0-9_]
>>
>> Probably non-english won't handle this, but maybe you could try [ä-Ś] or
>> whatever makes sense for the alphabet of interest.
> 
> 
> I believe you can use Unicode character classes to handle this. For
> example, '\p{L}' will match any Unicode letter in any script, including
> non-English. Works for me in Octave 5.1.0.
> 
> https://www.regular-expressions.info/unicode.html
> 
> octave:3> regexp('f1o2oüö', '\p{L}')
> ans =
>    1   3   5   6   8

I guess I should have gone all the way on this: the Unicode equivalent
of \w would be '[\p{L}\p{N}_]' or '[\p{L}\p{M}\p{N}_]', depending on how
you wanted to handle combining mark characters like accents.

Cheers,
Andrew

[Prev in Thread]

Current Thread

[Next in Thread]

regexp strangeness, Kay Nick, 2020/02/08
- Re: regexp strangeness, Daniel J Sebald, 2020/02/08
  - Re: regexp strangeness, Andrew Janke, 2020/02/08
    - Re: regexp strangeness, Andrew Janke <=
- regexp strangeness, Kay Nick, 2020/02/08
  - Re: regexp strangeness, Andreas Weber, 2020/02/08
    - Re: regexp strangeness, Kay Nick, 2020/02/08
    - Re: regexp strangeness, Andrew Janke, 2020/02/11

Prev by Date: Re: regexp strangeness
Next by Date: Re: Documentation on sources
Previous by thread: Re: regexp strangeness
Next by thread: regexp strangeness
Index(es):
- Date
- Thread