bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.


From: Paolo Bonzini
Subject: Re: [patch #6899] Speed-up for searching in multibyte and ignore-icase.
Date: Mon, 8 Mar 2010 21:51:57 +0100

>> Especially now that there is a good DFA-based matcher in glibc,
>
> Has anybody proven this?  There was a period of time where I removed
> dfa from gawk because "it wasn't needed anymore now that regex has a dfa
> matcher" but I had to reinstate it - the regex matcher was intolerably
> slower.  I added the GAWK_NO_DFA environment variable so that it'd be
> possible to test using just the regex matcher and I think grep now has
> a similar variable.

The main problems with the glibc DFA matcher are because it has to
track subexpression boundaries.  So it won't be as fast as dfa.c in
the general case.  Never.

On the other hand it developed many optimizations for UTF-8 that right
now make it faster than dfa.c in some common cases (while still being
slower on others).  On glibc systems it is also the only matcher that
can correctly handle bracket expressions (due to the shortsightedness
of the POSIX committee, in my not-so-humble opinion).

With respect to the former, the way to go is to add these
optimizations to dfa.c, and to eliminate other obvious performance
problems in grep and dfa.c's multibyte paths; I do have some patches
for that, but like my RFC patch I'll wait for you and Jim to finish
the sync.  Due to the issue with brackets, however we'll have to live
either with dfa.c being only 99% correct, or with it punting to
glibc's regex matcher.  Finding a balance is hard also because on
non-glibc platform we then lose speed _but not gain precision_.

I do think that Norihirio's DFA patch goes a bit too far, but it has
the merit of making the fastest possible dfa.c _as long as it doesn't
punt to glibc regex_.  In some sense, it provides a baseline for
comparison.  Does this make sense?

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]