bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multip


From: Jim Meyering
Subject: bug#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words
Date: Wed, 5 Aug 2015 21:12:33 -0700

On Fri, Dec 12, 2014 at 7:58 AM, Norihiro Tanaka <address@hidden> wrote:
> Hi,
>
> Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
> it is very slow for a worst case e.g. as following.  It is O(m*n).
>
>   - input: yes `printf %040d` | head -10000000
>   - word1: x0000000000000000000
>   - word2: x
>
> This change uses Aho-Corasick algorithm instead of Commentz-Walter
> algorithm to search multiple fixed words.  It uses a function to build
> tries which has been already defined for Commentz-Walter algorithm in
> kwset.c and which has been already high quality.
>
> I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
> by this change.
>
> First without this change (best-of-5 trials):
>
>     find /usr/share/doc/ -type f |
>     LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words 
> >/dev/null
>         real 11.37      user 11.03      sys 0.24
>
> Next with this change (best-of-5 trials):
>
>     find /usr/share/doc/ -type f |
>     LC_ALL=C  time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words 
> >/dev/null
>         real 1.49       user 1.31       sys 0.15

Nice patch.  I see a similar performance improvement.

> I also wrote two additional patches.
>
> Second, If search multiple fixed words, grep immediately returns without
> longest match if not needed.  Without this change, grep tries longest
> match for multiple words even if not needed.

That sounds like a very good idea and the change looks correct.

> Third, use memchr2 for two patterns of a character.

This sounds useful, but I discovered that the new code is not
triggered even once by any of our tests. As such, I suspect that
it is not justified to add these new conditionals in code that is
often inlined.  Do you have experiments that demonstrate how
that final patch helps?

Thank you,
Jim





reply via email to

[Prev in Thread] Current Thread [Next in Thread]