bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Jim Meyering
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Thu, 18 Sep 2014 12:36:57 -0700

On Thu, Sep 18, 2014 at 1:33 AM, Santiago Ruano Rincón
<address@hidden> wrote:
> El 17/09/14 a las 23:00, Paul Eggert escribió:
>> I've installed all the patches mentioned so far.
>>
>
> I've successfully build the latest commit
> (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
> performance boost. Rather the opposite.
>
> Comparing with debian's grep 2.20-3, that includes your first patch to solve
> this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:
>
> grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
> src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 total
>
> Note that basic grep also slowdowns:
>
> grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
> src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Thank you for running timing comparisons.

Once I verified that I had no large, sparse files in my grep working directory,
I ran the same test there (du -sh . reports 176M, du --app -sh . reports 139M)

The following shows a performance regression when searching files
like those in my grep working directory.
The new grep (v2.20-46-gf6de00f) takes 2.5x longer than 2.20.14.
This is with a hot cache (best of several runs) on a
Intel(R) Xeon(R) CPU E5-2660, compiled with gcc-5.x

$ diff -u <(env time grep -r asdf . 2>&1) <(PATH=src:$PATH env time
grep -r asdf . 2>&1)
--- /proc/self/fd/11    2014-09-18 12:07:43.169721947 -0700
+++ /proc/self/fd/12    2014-09-18 12:07:43.169721947 -0700
@@ -1,3 +1,3 @@
 ./src/grep.c:               printf 'asdfqwerzxcv\rASDF\tZXCV\n'
 -0.08user 0.10system 0:00.18elapsed 100%CPU (0avgtext+0avgdata
6256maxresident)k
 -0inputs+0outputs (0major+670minor)pagefaults 0swaps
 +0.40user 0.11system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 5328maxresident)k
 +0inputs+0outputs (0major+634minor)pagefaults 0swaps

It looks like most of the difference is the result of
commit cd36abd46c5e0768606979ea75a51732062f5624,
"grep: treat a file as binary if its prefix contains encoding errors",
with its new,
locale-sensitive "is_binary" test. I saw the above timing difference
even with LC_ALL=C, so one quick fix would be to skip the use of
mbrlen when possible.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]