bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24975: Matching issues with characters whose encoding ends in some o


From: Jim Meyering
Subject: bug#24975: Matching issues with characters whose encoding ends in some other character
Date: Sun, 20 Nov 2016 21:53:29 -0800

On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
<address@hidden> wrote:
> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>> $ locale charmap
>> GB18030
>> $ printf '\uC9\n' | grep  '.*7'  | hd
>> 00000000  81 30 87 37 0a                                    |.0.7.|
>> 00000005
>>
>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
> [...]
>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
> [...]
>
> Same behaviour with 2.26 on Solaris 11.

Thank you for the report.
I can reproduce that error on Fedora 25 with this:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
  5

I confirmed that the problem does not arise (i.e., no match, with exit
status of 1) when we force the use of glibc's regex matcher by
inserting a trivial back-reference:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
'()\1.*7' k); echo $?
  1

This bisected to v2.18-54-g3ef4c8e, but that commit was just the
messenger: it exposed the latent bug by making it so this case was no
longer handled by glibc's regexp matcher, but rather by grep's dfa.c.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]