bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bracket expressions with character ranges are slow


From: Jim Meyering
Subject: Re: Bracket expressions with character ranges are slow
Date: Sun, 08 May 2011 09:41:10 +0200

Seth David Schoen wrote:
> In grep 2.6.3 from Ubuntu Maverick, I noticed that patterns using
> a character range in a bracket expression are extraordinarily slow.
> E.g.
>
> address@hidden:~$ time grep '[A-Z]' /usr/share/dict/words > /dev/null
>
> real  0m26.819s
> user  0m26.134s
> sys   0m0.012s
> address@hidden:~$ time grep -v '[A-Z]' /usr/share/dict/words > /dev/null
>
> real  0m27.438s
> user  0m26.130s
> sys   0m0.008s
>
> (I first noticed this problem when trying to make a word list that
> excludes proper names for use in an anagram game.)
>
> What's really odd about this is that the same pattern written without a
> range, using an explicit list of characters, is dramatically faster:
>
> address@hidden:~$ time grep '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
> /usr/share/dict/words > /dev/null
>
> real  0m0.020s
> user  0m0.012s
> sys   0m0.004s
> address@hidden:~$ time grep -v '[ABCDEFGHIJKLMNOPQRSTUVWXYZ]'
> /usr/share/dict/words > /dev/null
>
> real  0m0.033s
> user  0m0.032s
> sys   0m0.000s
>
> That is, using the explicit list of characters in the backet
> expression was three orders of magnitude faster!  Using the
> character class [:upper:] is also fast:
>
> address@hidden:~$ time grep  '[:upper:]' /usr/share/dict/words > /dev/null
>
> real  0m0.026s
> user  0m0.024s
> sys   0m0.000s
>
> The disparity in the snapshot version released today, 2.7.43-ed23d,
> is much less extreme, but it's still present:
>
> address@hidden:~/grep-2.7.43-ed23d/src$ time ./grep '[A-Z]'
> /usr/share/dict/words > /dev/null
>
> real  0m0.417s
> user  0m0.372s
> sys   0m0.016s
> address@hidden:~/grep-2.7.43-ed23d/src$ time ./grep
> [ABCDEFGHIJKLMNOPQRSTUVWXYZ]' /usr/share/dict/words > /dev/null
>
> real  0m0.025s
> user  0m0.024s
> sys   0m0.000s
>
> Now using the character range is only one order of magnitude slower,
> instead of three.
>
> In grep 2.6.3, the effect seems to depend linearly on the number of
> characters in the range.  In 2.7.43-ed23d, there seems to be the
> same slowdown as a result of using any range, regardless of whether
> it contains dozens of characters or just three, two, or one:
>
> address@hidden:~/grep-2.7.43-ed23d/src$ time ./grep '[X-Z]'
> /usr/share/dict/words > /dev/null
>
> real  0m0.428s
> user  0m0.388s
> sys   0m0.012s
> address@hidden:~/grep-2.7.43-ed23d/src$ time ./grep '[XYZ]'
> /usr/share/dict/words > /dev/null
>
> real  0m0.032s
> user  0m0.012s
> sys   0m0.004s
>
> The problem affects egrep as well as grep.
>
> The time taken in each case seems to be proportional to the amount
> of input searched, so the difference is _not_ just in the initial
> setup.  I don't know anything about the implementation here, but
> it seems unnecessary for character ranges to be so inefficient,
> since a preprocessor could easily turn each range in a bracket
> expression into an explicit list of characters...

Thanks for the detailed report.
I presume you are using a multi-byte locale like en_US.utf8.
Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent.
The A-Z range may contain many other characters, including lower case letters.
If your search strings and data are all single-byte,
you may prefer to use a single-byte locale.
If you set LC_CTYPE=C in your environment, all of the above
will run quickly.

That would avoid the inherent expense of using a UTF-8 locale.
When you use the C locale, grep can take advantage of the
simpler locale and works more like you would expect:

  $ LC_CTYPE=C env time ./grep '[XYZ]' /usr/share/dict/words > /dev/null
  0.02user 0.00system 0:00.02elapsed 96%CPU (0avgtext+0avgdata 3120maxresident)k
  0inputs+0outputs (0major+227minor)pagefaults 0swaps
  $ LC_CTYPE=C env time ./grep '[X-Z]' /usr/share/dict/words > /dev/null
  0.02user 0.00system 0:00.02elapsed 92%CPU (0avgtext+0avgdata 3136maxresident)k
  0inputs+0outputs (0major+228minor)pagefaults 0swaps

Contrast that with the 4 seconds it takes when I'm using a
common multi-byte locale:

  $ LC_CTYPE=en_US.utf8 env time ./grep '[X-Z]' /usr/share/dict/words > 
/dev/null
  3.33user 0.77system 0:04.12elapsed 99%CPU (0avgtext+0avgdata 4208maxresident)k
  0inputs+0outputs (0major+353974minor)pagefaults 0swaps



reply via email to

[Prev in Thread] Current Thread [Next in Thread]