bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bracket expressions with character ranges are slow


From: Seth David Schoen
Subject: Re: Bracket expressions with character ranges are slow
Date: Sun, 8 May 2011 15:58:26 -0700
User-agent: Mutt/1.5.20 (2009-06-14)

Jim Meyering writes:

> Thanks for the detailed report.
> I presume you are using a multi-byte locale like en_US.utf8.
> Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent.
> The A-Z range may contain many other characters, including lower case letters.
> If your search strings and data are all single-byte,
> you may prefer to use a single-byte locale.
> If you set LC_CTYPE=C in your environment, all of the above
> will run quickly.
> 
> That would avoid the inherent expense of using a UTF-8 locale.
> When you use the C locale, grep can take advantage of the
> simpler locale and works more like you would expect:

Thanks, that's definitely the source of the problem.  I appreciate
the explanation.  I did some more tests with this and found that
searches with bracket expressions in my UTF-8 locale are slow when
the elements inside the brackets contain both a single-byte character
and a multi-byte character.  So [ab], [üçå], [美国], and [ł天] are all
fast, but [人a] and [aö] are quite slow.

Maybe I need to think more about how UTF-8 works, but I don't quite
see why these bracket expressions need to be as slow as they are.
In particular, egrep REs using | are quite fast even when the number
of bytes in the expressions on either side of the | is different, like

egrep '(mew|meow)'

If I manually rewrite the bracket expressions to use the vertical pipe,
they get fast again even if the number of bytes in the characters
varies:

address@hidden:~$ time egrep '[人a]' /usr/share/dict/words > /dev/null

real    1m7.780s
user    1m7.164s
sys     0m0.028s
address@hidden:~$ time egrep '(人|a)' /usr/share/dict/words > /dev/null

real    0m0.048s
user    0m0.024s
sys     0m0.000s

The same difference appears for [öa] and (ö|a).  (It's true that
my /usr/share/dict/words doesn't contain 人, but I wrote a program
to add 人 and a few other multibyte characters to random positions
within 1/10 of the words in /usr/share/dict/words, and egrep's
behavior seemed correct and fast when running over that file too.)

I think [人a] and (人|a) specify equivalent patterns, so I don't
see why one of them has to be much slower than the other.  Wouldn't
it be possible to rewrite the format into the latter?

-- 
Seth David Schoen <address@hidden> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]