bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bracket expressions with character ranges are slow


From: Seth David Schoen
Subject: Re: Bracket expressions with character ranges are slow
Date: Wed, 18 May 2011 10:44:27 -0700
User-agent: Mutt/1.5.20 (2009-06-14)

Paolo Bonzini writes:

> On 05/09/2011 12:58 AM, Seth David Schoen wrote:
> >Thanks, that's definitely the source of the problem.  I appreciate
> >the explanation.  I did some more tests with this and found that
> >searches with bracket expressions in my UTF-8 locale are slow when
> >the elements inside the brackets contain both a single-byte character
> >and a multi-byte character.  So [ab], [üçå], [美国], and [ł天] are all
> >fast, but [人a] and [aö] are quite slow.
> >
> >Maybe I need to think more about how UTF-8 works, but I don't quite
> >see why these bracket expressions need to be as slow as they are.
> 
> You are correct that these cases (unlike ranges) can be optimized.

Suppose grep had a preprocessor that converted any bracket
expression containing elements of different byte sizes, whether
[美国a] or a range not all of whose characters are a single byte,
into a parenthesized alternation like (美|国|a).  Would this use
more memory, constituting a space-for-time tradeoff?  If not, is
there some other reason not to do this?  Is there some other
case in which matching the alternation becomes less efficient than
matching the bracket expression?

I realize this is only potentially possible for egrep, at least at
the surface level of rewriting the regular expression.

-- 
Seth David Schoen <address@hidden> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]