bug#24975: Matching issues with characters whose encoding ends in some o

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24975: Matching issues with characters whose encoding ends in some o

From:	Jim Meyering
Subject:	bug#24975: Matching issues with characters whose encoding ends in some other character
Date:	Sun, 20 Nov 2016 21:53:29 -0800

On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas
<address@hidden> wrote:
> 2016-11-20 21:50:28 +0000, Stephane Chazelas:
>> $ locale charmap
>> GB18030
>> $ printf '\uC9\n' | grep  '.*7'  | hd
>> 00000000  81 30 87 37 0a                                    |.0.7.|
>> 00000005
>>
>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030).
> [...]
>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64.
> [...]
>
> Same behaviour with 2.26 on Solaris 11.

Thank you for the report.
I can reproduce that error on Fedora 25 with this:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c
  5

I confirmed that the problem does not arise (i.e., no match, with exit
status of 1) when we force the use of glibc's regex matcher by
inserting a trivial back-reference:

  $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E
'()\1.*7' k); echo $?
  1

This bisected to v2.18-54-g3ef4c8e, but that commit was just the
messenger: it exposed the latent bug by making it so this case was no
longer handled by glibc's regexp matcher, but rather by grep's dfa.c.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24975: Matching issues with characters whose encoding ends in some other character, Stephane Chazelas, 2016/11/20
- bug#24975: Matching issues with characters whose encoding ends in some other character, Stephane Chazelas, 2016/11/20
  - bug#24975: Matching issues with characters whose encoding ends in some other character, Jim Meyering <=
    - bug#24975: Matching issues with characters whose encoding ends in some other character, Jim Meyering, 2016/11/27
    - bug#24975: Matching issues with characters whose encoding ends in some other character, Norihiro Tanaka, 2016/11/28
    - bug#24975: Matching issues with characters whose encoding ends in some other character, Jim Meyering, 2016/11/28
    - bug#24975: Matching issues with characters whose encoding ends in some other character, Norihiro Tanaka, 2016/11/28
    - bug#24975: Matching issues with characters whose encoding ends in some other character, Paul Eggert, 2016/11/28

Prev by Date: bug#24961: input option -z alters behavior of output option -o in an undocumented way
Next by Date: bug#25027: Grep 2.25 misses lines in semi-large file
Previous by thread: bug#24975: Matching issues with characters whose encoding ends in some other character
Next by thread: bug#24975: Matching issues with characters whose encoding ends in some other character
Index(es):
- Date
- Thread