bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x)


From: Eric Blake
Subject: bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales
Date: Mon, 23 Dec 2013 15:52:26 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

On 12/23/2013 03:39 PM, Jim Meyering wrote:
> FYI, here is a quick and clean/safe performance improvement for grep -i.
> I expect to push this commit right after the upcoming bug-fix release.
> Currently, this optimization is enabled when the search string is
> ASCII and contains neither of '\' (backslash) nor '['.  I expect to
> eliminate the latter two constraints in a follow-on commit including
> tests to exercise all of the corner cases.
> 

> +
> +  /* Worst case is that every byte of keys will be alpha,
> +     so every byte B will map to the sequence of 4 bytes [Bb].  */

Umm, is this always true?  Consider the UTF-8 Turkish locale, where
single-byte i has a multi-byte uppercase (and conversely the single-byte
I has a multi-byte lowercase) - that is, 'i' and 'I' are not case pairs.

> +      else
> +        {
> +          *p++ = '[';
> +          *p++ = *keys;
> +          *p++ = islower (*keys) ? toupper (*keys) : tolower (*keys);

This performs the ASCII-only toupper/tolower, rather than the proper
locale-sensitive case conversion, which probably makes this patch
misbehave for LC_ALL=tr_TR.UTF-8.

> 
> +  /* As currently implemented, case-insensitive matching is expensive in
> +     multi-byte locales because of a few outlier locales in which some
> +     characters change size when converted to upper or lower case.  To
> +     accommodate those, we revert to searching the input one line at a
> +     time, rather than using the much more efficient buffer search.
> +     However, if we have a plain ascii search string, /foo/, we can
> +     convert it to an equivalent case-insensitive /[fF][oO][oO]/, and thus
> +     avoid the expensive read-and-process-a-line-at-a-time requirement.
> +     Optimize-away the "-i" option, when possible, converting each
> +     candidate alpha, C, in the regexp to [Cc].  */

In other words, this comment describes the very flaw of the outlier
tr_TR locale that you are now violating with your optimization.

Without your patch, this is correct behavior:

$ echo i | LC_ALL=tr_TR.UTF-8 grep -i I
$ echo i | LC_ALL=tr_TR.UTF-8 grep -i i
i

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]