bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte


From: Jim Meyering
Subject: bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
Date: Mon, 23 Sep 2013 14:04:09 -0700

[using the right bug address, this time]

On Mon, Sep 23, 2013 at 11:26 AM, Aharon Robbins <address@hidden> wrote:
> Hi.
>
>>     $ printf '\x82\n' > in; ./grep -q '\S' in && echo match
>>     match
>>
>> Now, require a back-reference (forcing switch from grep's DFA matcher
>> to use of the regex functions), and you see there is no match:
>>
>>     $ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
>>     $
>
> I see similar results with gawk, accounting for syntactic difference
> and a different way to force the regex matcher.
>
> So far so good.
>
>> Uh oh.  This is worse: \s is not multi-byte aware.
>> The two-byte "NO-BREAK SPACE" character is not matched by \s.
>>
>> This fails:
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
>>     $
>>
>> This matches in spite of the fact that grep.texi says \s is
>>      equivalent to [[:space:]] :
>>     $ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
>>     a b
>>
>> GNU grep fails:
>> (but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
>>     $ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
>>     $
>
> I cannot reproduce this with gawk.  Setting GAWK_NO_DFA=1 in the
> environment causes gawk to bypass dfa. For these it makes no
> difference:
>
> $ printf 'a\xc2\xa0b\n' | ./gawk '/a\sb/'
> $ printf 'a\xc2\xa0b\n' | GAWK_NO_DFA=1 ./gawk '/a\sb/'
>
> No result from either, and similar results for [[:space:]].

Hi Arnold,
[re-adding CC to the bug tracker]

Thanks for testing.
When I test on glibc, I confirm what you report: [[:space:]] fails to
match NBSP.  Makes me think either glibc's UTF8 attribute tables are
wrong, or there's a bug in regex:

  $ printf 'a\xc2\xa0b\n'|LC_ALL=en_US.
UTF-8 grep 'a[[:space:]]b'
  [Exit 1]

Initially, I considered constructing a DFA that would match all UTF8
white space characters (see the FIXME comment), and another that would
match the complement of that set minus the set of invalid UTF8 bytes,
but ended up preferring the simpler change.

FTR, I tested this only on a system for which all tests passed (OS/X).
 Very surprised to find it doesn't work on a glibc-based system.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]