[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte
From: |
Jim Meyering |
Subject: |
bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte |
Date: |
Sun, 22 Sep 2013 22:17:01 -0700 |
This one really surprised me.
Learning that multibyte \s and \S had been broken since grep-2.6 did
not make my day. But fixing it helped.
Here's how it started:
To demonstrate the (first)bug, set up to use a UTF8 locale:
export LC_ALL=en_US.UTF-8
then run this and note that it matches:
$ printf '\x82\n' > in; ./grep -q '\S' in && echo match
match
Now, require a back-reference (forcing switch from grep's DFA matcher
to use of the regex functions), and you see there is no match:
$ printf '\x82\x82\n' > in; ./grep -qE '(\S)\1' in && echo match
$
One fix would be to make it so dfaexec's \S-processing fails to match an
invalid multibyte sequence, just as it's "."-processing does.
That led me to this realization:
Uh oh. This is worse: \s is not multi-byte aware.
The two-byte "NO-BREAK SPACE" character is not matched by \s.
This fails:
$ printf 'a\xc2\xa0b\n'|./grep 'a\sb'
$
This matches in spite of the fact that grep.texi says \s is
equivalent to [[:space:]] :
$ printf 'a\xc2\xa0b\n'|./grep 'a[[:space:]]b'
a b
GNU grep fails:
(but if I do s/\\s/[[:space:]]/ to the RE, then it does match)
$ printf 'a\xc2\xa0ba\xc2\xa0b\n'|./grep -E '(a\sb)\1' grep:
$
Patch attached:
0003-dfa-fix-s-and-S-to-work-for-multibyte.patch
Description: Binary data
- bug#15440: [PATCH] dfa: fix \s and \S to work for multibyte,
Jim Meyering <=