bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8


From: Jim Meyering
Subject: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date: Tue, 12 Jun 2012 17:58:49 +0200

Johannes Meixner wrote:
> Hello,
>
> On Jun 1 12:02 Jim Meyering wrote (excerpt):
>>
>>    i='\xC4\xB0'
>>    printf "$i$i$i$i$i$i$i\n" > in
>>    LC_ALL=en_US.UTF-8 grep -i .... in > out
>>    cmp in out > /dev/null || echo FAIL
>>
>> As I mentioned in the link above, this is a problem because of the way
>> grep's -i is implemented: it converts both the RE and the buffer-to-search
>> to lower case, and then performs the search.  The problem arises with
>> turkish-I because the conversion changes the length of the buffer (in
>> the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
>> + newline, yet the lower case version has a length of just 8: 7 x
>> lower-cased i + NL), and the code returns the match offset and length
>> relative to the shortened lower-case buffer (that lower-cased buffer is
>> internal to code duplicated in EGexecute/Fexecute), yet it uses those
>> offset,length numbers to manipulate the original buffer.
>>
>> Without re-architecting too much, one solution is to change mbtolower to
>> return additional information: a malloc'd mapping vector M, of the same
>> length as its returned buffer, where M[i] is the length-in-bytes of the
>> character that formed byte I of the result.  With that, or something
>> similar, the caller could then map the currently-erroneous offset,len
>> numbers to equivalent numbers that apply to the original buffer.  This
>> mapping could be allocated/defined only when lengths actually differ,
>> so that the cost in general would be negligible.
>
> I am not at all a localization expert and perhaps I misunderstand
> something but perhaps it is not safe to only test if lengths differ.
>
> I fear there exists a special locale setting where a special
> multibyte character string exists where its lower-cased counterpart
> has same length but nevertheless the character positions in both
> strings do not match.
>
> I am thinking about something like a two-character string
> "[3-byte-upper-case-character-1][2-byte-upper-case-character-2]"
> where its lower-cased counterpart is
> "[2-byte-lower-case-character-1][3-byte-lower-case-character-2]"
>
> Something like "[AAA][BB]" versus "[aa][bbb]" where
> [AAA] is a 3-byte upper-case character where
> [aa] is its 2-byte lower-case counterpart and
> [BB] is a 2-byte upper-case character where
> [bbb] is its 3-byte lower-case counterpart.

Nice catch.
Thank you for reporting that.

> Do such or similar kind of strings actually exist?

I'll bet it's possible.
If someone comes up with an example, please let us know.
All it takes is a lower case character (in a UTF-8 locale) that
is longer than its upper case companion.  Then put that upper
case character on a line with the turkish I-with-dot, and run grep -i
to select that line.

> If yes could such kind of strings still cause errors?

Yes.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]