[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-grep] UTF-8 performance: progress report
From: |
Tim Waugh |
Subject: |
[bug-grep] UTF-8 performance: progress report |
Date: |
Wed, 15 Dec 2004 13:29:55 +0000 |
User-agent: |
Mutt/1.4.1i |
Hi,
I have been working on improving grep's performance in the UTF-8
encoding, and thought I'd send a progress report.
Below are some simple benchmarking results comparing two binaries:
* grep-2.5.1 as released, configured with --without-included-regex
* grep-2.5.1-31.3, built for Fedora Core 3. Several patches are
applied, and it is configured with --without-included-regex. Among
the applied patches:
o dfa-optional: this makes the use of the DFA conditional on whether
the current locale character encoding is a multibyte one. For
UTF-8, the DFA is turned off. I posted results of this
improvement early last month.
o egf-speedup: this reduces the multibyte processing (mbrtowc etc)
considerably by only using it when necessary. For the special
case of UTF-8, without using the built-in DFA, this can be never
as far as grep is concerned; of course the system re_search()
function has to be aware of multibyte handling.
Both run on the same machine, and the installed C library is
glibc-2.3.3-90 from the Fedora development repository.
Here is the simple test script I used:
==>
perl -e '$a="0123456789"x7;$a.="\n";print $a x 400000' >input
echo " ASCII:" > a
(export LANG=C; time $GREP 'foo' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP '0.3' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP -v '$' input) 2>&1 | grep user >> a
(export LANG=C; time $GREP -v '90123456789' input) 2>&1 | grep user >> a
echo " UTF-8:" > b
(export LANG=en_GB.UTF-8; time $GREP 'foo' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP '0.3' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP -v '$' input) 2>&1 | grep user >> b
(export LANG=en_GB.UTF-8; time $GREP -v '90123456789' input) 2>&1 | grep user
>> b
paste <(expand a) <(expand b)
<==
First the results for grep-2.5.1 as released:
ASCII: UTF-8:
user 0m0.125s user 0m9.460s
user 0m0.554s user 0m25.188s
user 0m2.464s user 39m26.313s
user 0m0.293s user 35m55.760s
Now the much-improved results for grep-2.5.1-31.3:
ASCII: UTF-8:
user 0m0.123s user 0m0.126s
user 0m0.564s user 0m13.152s
user 0m2.500s user 0m12.179s
user 0m0.293s user 0m0.291s
For the last test, the UTF-8 processing appears faster than the ASCII
processing. This shows that for that pattern, what overhead UTF-8 may
incur is lost in the noise.
You can see the patches that are applied in grep-2.5.1-31.3 here:
ftp://people.redhat.com/twaugh/tmp/grep/fc3/unpacked/
Tim.
*/
pgplj4nkqNMB7.pgp
Description: PGP signature
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [bug-grep] UTF-8 performance: progress report,
Tim Waugh <=