[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bracket expressions with character ranges are slow
From: |
Seth David Schoen |
Subject: |
Re: Bracket expressions with character ranges are slow |
Date: |
Sun, 8 May 2011 15:58:26 -0700 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
Jim Meyering writes:
> Thanks for the detailed report.
> I presume you are using a multi-byte locale like en_US.utf8.
> Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent.
> The A-Z range may contain many other characters, including lower case letters.
> If your search strings and data are all single-byte,
> you may prefer to use a single-byte locale.
> If you set LC_CTYPE=C in your environment, all of the above
> will run quickly.
>
> That would avoid the inherent expense of using a UTF-8 locale.
> When you use the C locale, grep can take advantage of the
> simpler locale and works more like you would expect:
Thanks, that's definitely the source of the problem. I appreciate
the explanation. I did some more tests with this and found that
searches with bracket expressions in my UTF-8 locale are slow when
the elements inside the brackets contain both a single-byte character
and a multi-byte character. So [ab], [üçå], [美国], and [ł天] are all
fast, but [人a] and [aö] are quite slow.
Maybe I need to think more about how UTF-8 works, but I don't quite
see why these bracket expressions need to be as slow as they are.
In particular, egrep REs using | are quite fast even when the number
of bytes in the expressions on either side of the | is different, like
egrep '(mew|meow)'
If I manually rewrite the bracket expressions to use the vertical pipe,
they get fast again even if the number of bytes in the characters
varies:
address@hidden:~$ time egrep '[人a]' /usr/share/dict/words > /dev/null
real 1m7.780s
user 1m7.164s
sys 0m0.028s
address@hidden:~$ time egrep '(人|a)' /usr/share/dict/words > /dev/null
real 0m0.048s
user 0m0.024s
sys 0m0.000s
The same difference appears for [öa] and (ö|a). (It's true that
my /usr/share/dict/words doesn't contain 人, but I wrote a program
to add 人 and a few other multibyte characters to random positions
within 1/10 of the words in /usr/share/dict/words, and egrep's
behavior seemed correct and fast when running over that file too.)
I think [人a] and (人|a) specify equivalent patterns, so I don't
see why one of them has to be much slower than the other. Wouldn't
it be possible to rewrite the format into the latter?
--
Seth David Schoen <address@hidden> | Qué empresa fácil no pensar en
http://www.loyalty.org/~schoen/ | un tigre, reflexioné.
http://vitanuova.loyalty.org/ | -- Borges, El Zahir