[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug #13859] matching '([A]|[B]){2}' in different locales
From: |
Tim Waugh |
Subject: |
Re: [bug #13859] matching '([A]|[B]){2}' in different locales |
Date: |
Thu, 21 Jul 2005 10:16:40 +0100 |
User-agent: |
Mutt/1.4.2.1i |
On Wed, Jul 20, 2005 at 05:51:14PM -0400, Charles Levert wrote:
> Does the answer depend on the chosen locale?
> Is there still a justification for grep's DFA?
The answer may well depend on the locale, and there may indeed still
be justification for grep's own DFA. Here are the results of some
timings I did a while ago and posted to the old grep-devel list:
Date: Fri, 5 Nov 2004 18:14:04 +0000
Message-ID: <address@hidden>
Subject: GREP_NO_DFA environment variable
From: Tim Waugh <address@hidden>
To: grep-devel-list <address@hidden>
--hi1pl0KEDnHSGXJ/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Hi,
Following Aharon's lead I've made a patch for grep to make it avoid
using dfaexec when the GREP_NO_DFA environment variable is set. This
is to facilitate more accurate benchmarking of the effect of DFA on a
variety of searches.
I've tried some tests, and have drawn a conclusion: scroll down for
my suggestion as to how to make grep fast in all cases.
C locale
--------
For some tests I find no measureable advantage to DFA:
$ export LC_CTYPE=C
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep foo input) 2>&1 | grep user
user 0m0.126s
$ (export GREP_NO_DFA=1 ; time ./grep foo input) 2>&1 | grep user
user 0m0.132s
(The results for each test varied by over 0.1s.)
For other tests, the DFA seemed to offer a significant advantage:
$ export LC_CTYPE=C
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep 0.3 input) 2>&1 | grep user
user 0m0.571s
$ (export GREP_NO_DFA=1; time ./grep 0.3 input) 2>&1 | grep user
user 0m9.342s
$ (time ./grep -v $ input) 2>&1 | grep user
user 0m2.619s
$ (export GREP_NO_DFA=1; time ./grep -v $ input) 2>&1 | grep user
user 0m11.937s
For these three types of search ("foo", "0.3", and "$"), DFA
definitely seems to be a win -- in the C locale.
UTF-8 locales
-------------
When using UTF-8 it is an entirely different matter, to put it
mildly. Although searches in which the pattern is rare do not show
much difference in running times:
$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep foo input) 2>&1 | grep user
user 0m0.132s
$ (export GREP_NO_DFA=1 ; time ./grep foo input) 2>&1 | grep user
user 0m0.132s
searches with common patterns are adversely affected by DFA:
$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep 0.3 input) 2>&1 | grep user
user 0m15.713s
$ (export GREP_NO_DFA=1 ; time ./grep 0.3 input) 2>&1 | grep user
user 0m12.542s
Sometimes wildly:
$ export LC_CTYPE=en_GB.UTF-8
$ perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
$ (time ./grep -v $ input) 2>&1 | grep user
user 40m4.108s
$ (export GREP_NO_DFA=1 ; time ./grep -v $ input) 2>&1 | grep user
user 0m11.958s
CONCLUSION
==========
My suggestion is: let's turn off DFA when MB_CUR_MAX > 1. While it
speeds up ASCII searching by quite some measure, it slows down UTF-8
searching (for example) by very much more for no appreciable gain.
The source code used for these tests is the Fedora Core RPM
grep-2.5.1-35, available at:
ftp://people.redhat.com/twaugh/tmp/grep/
Tim.
*/
--hi1pl0KEDnHSGXJ/
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iD8DBQFBi8LsHU/d4jnpWe0RAgGkAJ4tv95XkTVISby0HZpLV5MXEWEvPACaAqUg
US0gOeHnl0d/Q0rBeNxgy9A=
=tzH1
-----END PGP SIGNATURE-----
--hi1pl0KEDnHSGXJ/--