bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: grep -i in UTF-8: newline not printed after matching line if it cont


From: Jim Meyering
Subject: Re: grep -i in UTF-8: newline not printed after matching line if it contains I WITH DOT (U+0130)
Date: Tue, 14 Dec 2010 13:57:41 +0100

Ilya Basin wrote:
> $ grep -i . greptest.txt
> aIabIbcIcdId$
>
> This doesn't happen without -i or with LANG=C
>
>
> $ grep --version
> grep (GNU grep) 2.7
> $ echo $LANG
> en_US.UTF-8
>
> pcre 8.10
>
> Linux IL 2.6.36-ARCH #1 SMP PREEMPT Wed Nov 24 06:44:11 UTC 2010 i686
> Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GenuineIntel GNU/Linux

Thanks for the report.  That is indeed a bug.
It affects even the very latest in git.

Here's another variant of it:
[note how it fails to print the matched "."]

    $ i='\xC4\xB0'; printf "$i$i$i.$i$i$i$i\n" \
      | LC_ALL=en_US.UTF-8 ./grep -oi '.\.'|od -a -tx1
    0000000   D   0  nl
             c4  b0  0a
    0000003

-----------------------------
More like your example, this shows how, with -i,
grep is searching a different string (down-cased)
and the width of the lower-case "i" is just one byte.
The end-of-line offset is calculated using the all-lower-case
string, yet that offset is not valid in the original, longer string,
so grep fails to print the entire line:

    i='\xC4\xB0'; printf "$i$i$i$i$i$i$i\n" |LC_ALL=en_US.UTF-8 ./grep -i ....
    İİİİ

One of us should find time to fix it before too long.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]