bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep is horriby slow in UTF-8 locales


From: Markus Kuhn
Subject: grep is horriby slow in UTF-8 locales
Date: Fri, 07 Nov 2003 12:52:44 +0000

On Red Hat 9:

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps

where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.

It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.

And fgrep is no better.

There is technically no reason, why grep should have to be any slower in
a UTF-8 locale than in a single-byte locale if the string does not even
contain any regex meta characters at all. In that case, UTF-8 can be
processed just like ASCII.

In UTF-8 mode, grep is also much slower than the equivalent Perl:

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps

Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.

Markus

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain





reply via email to

[Prev in Thread] Current Thread [Next in Thread]