[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16966: [PATCH] grep: optimization with the superset of DFA
From: |
Norihiro Tanaka |
Subject: |
bug#16966: [PATCH] grep: optimization with the superset of DFA |
Date: |
Fri, 04 Apr 2014 12:28:47 +0900 |
Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET
Sorry, it was wrong yet. It should be as follows.
s0: The position set is none.
s1: The position set is 1:a 2:CSET 3:b
s2: The position set is 1:a 2:CSET 3:b (accepted)
First of all, I noticed that the example is wrong, because if fixed
strings are included in the pattern, kwset is used for the pattern,
and dfahint() is called for each single line.
--
By the way, I compared two cases, which are CSET* and
CSET{1,mb_cur_max}. I used `\(a\|z\)[x-y]\{10\}\(b\|z\)' as the pattern.
`[x-z]' is MBCSET in UTF-8 locale.
If replace MBCSET to CSET*, the pattern in the superset is
`\(a\|z\)\(CSET*\)\{10\}\(b\|z\)'.
If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is
`\(a\|z\)\(CSET\{1,6\}\)\{10\}\(b\|z\)'.
In order to simplify, use `\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' in
POSIX locale for later.
Before testing, apply the patch attached on this mail, Set `DDEBUG' to
CPPFLAGS, and re-build. It outputs more debugging information.
$ yes `printf '%0100d\n'` | head -320 | \
sed -e '1s/^0/a/; $s/0$/b/' | \
env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' 2>debug1.log
$ yes `printf '%0100d\n'` | head -320 | \
sed -e '1s/^0/a/; $s/0$/b/' | \
env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' 2>debug2.log
(`320' originates in the default size of a buffer.)
The formar is only built 3 states.
OTOH, the later is built 224 dfastates.
i.e. even if matched multi-lines, many states aren't necessarily built.
Next, I checked performance for each case in worst text.
It's best in 10 times.
yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t
yes t | head -1000000 | xargs cat >u
time -p env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' u
real 1.26 user 0.95 sys 0.28
time -p env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' u
real 0.74 user 0.45 sys 0.27
Though, the later is fast, I like the former, because it's very simple.
the later is complicated, also needs to update nleaves and depth.
Norihiro
patch.txt
Description: Text document
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Paolo Bonzini, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Paolo Bonzini, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Paolo Bonzini, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/01
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Paolo Bonzini, 2014/04/02
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/02
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/02
- bug#16966: [PATCH] grep: optimization with the superset of DFA,
Norihiro Tanaka <=
- bug#16966: [PATCH] grep: optimization with the superset of DFA, Norihiro Tanaka, 2014/04/01