[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multip
From: |
Norihiro Tanaka |
Subject: |
bug#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words |
Date: |
Sat, 13 Dec 2014 00:58:02 +0900 |
Hi,
Searching multiple fixed words, grep uses Commentz-Walter algorithm, but
it is very slow for a worst case e.g. as following. It is O(m*n).
- input: yes `printf %040d` | head -10000000
- word1: x0000000000000000000
- word2: x
This change uses Aho-Corasick algorithm instead of Commentz-Walter
algorithm to search multiple fixed words. It uses a function to build
tries which has been already defined for Commentz-Walter algorithm in
kwset.c and which has been already high quality.
I see 7x speed-up even for a typical case on Fedora 21 with a 3.2GHz i5
by this change.
First without this change (best-of-5 trials):
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words
>/dev/null
real 11.37 user 11.03 sys 0.24
Next with this change (best-of-5 trials):
find /usr/share/doc/ -type f |
LC_ALL=C time -p xargs.sh src/grep -Ff /usr/share/dict/linux.words
>/dev/null
real 1.49 user 1.31 sys 0.15
I also wrote two additional patches.
Second, If search multiple fixed words, grep immediately returns without
longest match if not needed. Without this change, grep tries longest
match for multiple words even if not needed.
Third, use memchr2 for two patterns of a character.
Thanks,
Norihiro
0001-grep-use-Aho-Corasick-algorithm-to-search-multiple-f.patch
Description: Text document
0002-grep-immediately-return-without-longest-match-to-sea.patch
Description: Text document
0003-use-memchr2-for-two-patterns-of-a-character.patch
Description: Text document
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- bug#19358: [PATCH 1/3] grep: use Aho-Corasick algorithm to search multiple fixed words,
Norihiro Tanaka <=