[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)
From: |
Pádraig Brady |
Subject: |
bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales |
Date: |
Sat, 11 Jan 2014 14:15:58 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <address@hidden> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common
>>>> input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
>
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.
This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html
So picking another case situation from there:
"in the Greek script, capital sigma (U+03A3) is the uppercase form of both
the regular (U+03C2) and final (U+03C3) lowercase sigma."
One can see that sed handles this:
$ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
ςσΣΣ
$ printf '\u03A3\n' | sed 's/.*/&\L&/'
Σσ
Though I was surprised the grep (2.14) didn't match any combo of these
$ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
$ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"
Not a regression of course.
cheers,
Pádraig.
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/07
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/10
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/10
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/11
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales,
Pádraig Brady <=
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/12