Re: Bracket expressions with character ranges are slow

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bracket expressions with character ranges are slow

From:	Seth David Schoen
Subject:	Re: Bracket expressions with character ranges are slow
Date:	Sun, 8 May 2011 15:58:26 -0700
User-agent:	Mutt/1.5.20 (2009-06-14)

Jim Meyering writes:

> Thanks for the detailed report.
> I presume you are using a multi-byte locale like en_US.utf8.
> Note that A-Z and ABCDEFGHIJKLMNOPQRSTUVWXYZ are not equivalent.
> The A-Z range may contain many other characters, including lower case letters.
> If your search strings and data are all single-byte,
> you may prefer to use a single-byte locale.
> If you set LC_CTYPE=C in your environment, all of the above
> will run quickly.
> 
> That would avoid the inherent expense of using a UTF-8 locale.
> When you use the C locale, grep can take advantage of the
> simpler locale and works more like you would expect:

Thanks, that's definitely the source of the problem.  I appreciate
the explanation.  I did some more tests with this and found that
searches with bracket expressions in my UTF-8 locale are slow when
the elements inside the brackets contain both a single-byte character
and a multi-byte character.  So [ab], [üçå], [美国], and [ł天] are all
fast, but [人a] and [aö] are quite slow.

Maybe I need to think more about how UTF-8 works, but I don't quite
see why these bracket expressions need to be as slow as they are.
In particular, egrep REs using | are quite fast even when the number
of bytes in the expressions on either side of the | is different, like

egrep '(mew|meow)'

If I manually rewrite the bracket expressions to use the vertical pipe,
they get fast again even if the number of bytes in the characters
varies:

address@hidden:~$ time egrep '[人a]' /usr/share/dict/words > /dev/null

real    1m7.780s
user    1m7.164s
sys     0m0.028s
address@hidden:~$ time egrep '(人|a)' /usr/share/dict/words > /dev/null

real    0m0.048s
user    0m0.024s
sys     0m0.000s

The same difference appears for [öa] and (ö|a).  (It's true that
my /usr/share/dict/words doesn't contain 人, but I wrote a program
to add 人 and a few other multibyte characters to random positions
within 1/10 of the words in /usr/share/dict/words, and egrep's
behavior seemed correct and fast when running over that file too.)

I think [人a] and (人|a) specify equivalent patterns, so I don't
see why one of them has to be much slower than the other.  Wouldn't
it be possible to rewrite the format into the latter?

-- 
Seth David Schoen <address@hidden> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir

[Prev in Thread]

Current Thread

[Next in Thread]

Bracket expressions with character ranges are slow, Seth David Schoen, 2011/05/08
- Re: Bracket expressions with character ranges are slow, Seth David Schoen, 2011/05/08
- Re: Bracket expressions with character ranges are slow, Jim Meyering, 2011/05/08
  - Re: Bracket expressions with character ranges are slow, Seth David Schoen <=
    - Re: Bracket expressions with character ranges are slow, Paolo Bonzini, 2011/05/18
    - Re: Bracket expressions with character ranges are slow, Seth David Schoen, 2011/05/18
    - Re: Bracket expressions with character ranges are slow, Paolo Bonzini, 2011/05/18

Prev by Date: Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale)
Next by Date: [bug #33271] non-US characters and -o
Previous by thread: Re: Bracket expressions with character ranges are slow
Next by thread: Re: Bracket expressions with character ranges are slow
Index(es):
- Date
- Thread