bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)

From:	Pádraig Brady
Subject:	bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date:	Sat, 11 Jan 2014 14:15:58 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <address@hidden> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common 
>>>> input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf 
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
> 
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.

This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html

So picking another case situation from there:
  "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
   the regular (U+03C2) and final (U+03C3) lowercase sigma."

One can see that sed handles this:
  $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
  ςσΣΣ
  $ printf '\u03A3\n' | sed 's/.*/&\L&/'
  Σσ

Though I was surprised the grep (2.14) didn't match any combo of these
  $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"

Not a regression of course.

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/07
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/10
- bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/10
  - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/10
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/11
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady <=
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Jim Meyering, 2014/01/11
    - bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales, Pádraig Brady, 2014/01/12

Prev by Date: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Next by Date: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Previous by thread: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Next by thread: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Index(es):
- Date
- Thread