[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)
From: |
Jim Meyering |
Subject: |
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales |
Date: |
Mon, 23 Dec 2013 15:54:27 -0800 |
On Mon, Dec 23, 2013 at 3:30 PM, Eric Blake <address@hidden> wrote:
> On 12/23/2013 04:12 PM, Jim Meyering wrote:
>> Did you miss the "isascii" check in the new trivial_case_convert function?
>
> No. But even with that check in place:
>
>> If you can describe circumstances in which the new patch malfunctions,
>> please do,
>> but everything you wrote seems to rely on a false assumption.
>
> No, it's a quite real complaint - your patch is broken for tr_TR.
>
>> E.g., your turkish-I example works fine with my patch.
>
> isascii('i') is true, but converting 'i' to '[iI]' is incorrect in the
> tr_TR locale. Rather, the conversion must be to '[iİ]'; similarly, 'I'
> would be translated to '[Iı]'. Neither of those conversions fit in 4
> bytes (since dotted-capital-I and dotless-lower-i are both multi-byte
> characters).
>
> Need help easily finding those characters on a non-Turkish keyboard? I
> used:
> $ echo iI | LC_ALL=tr_TR.UTF-8 sed 's/\(.\)\(.\)/\U\1\L\2/'
>
> At any rate, prior to your patch, lower dotless i in the buffer gives an
> insensitive match to upper dotless I in the pattern:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 grep -i I || echo no match
> ı
>
> After your patch:
>
> $ echo ı | LC_ALL=tr_TR.UTF-8 src/grep -i I || echo no match
> no match
>
> Oops, you failed to match lower dotless i insensitively against upper
> dotless I, because upper dotless I is ascii, but you incorrectly
> converted it into the wrong pattern.
Thanks for dotting those 'i's. While there is no risk of buffer
overrun, there would definitely be a problem with the tr_TR locale.
I will resolve it by removing the isascii check and performing
multibyte case conversion to form each [cC] pair. Of course,
that will mean removing the "4 * byte-length-of-search-string"
buffer size limitation.
I will also add tests based on your examples.