bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16966: [PATCH] grep: optimization with the superset of DFA


From: Norihiro Tanaka
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Fri, 04 Apr 2014 12:28:47 +0900

Norihiro Tanaka wrote:
> s0: The position set is none.
> s1: The position set is 1:a
> s2: The position set is 1:a 2:CSET
> s3: The position set is 2:CSET 3:b (accepted)
> s4: The position set is 2:CSET

Sorry, it was wrong yet.  It should be as follows.

s0: The position set is none.
s1: The position set is 1:a 2:CSET 3:b
s2: The position set is 1:a 2:CSET 3:b (accepted)

First of all, I noticed that the example is wrong, because if fixed
strings are included in the pattern, kwset is used for the pattern,
and dfahint() is called for each single line.

--
By the way, I compared two cases, which are CSET* and
CSET{1,mb_cur_max}.  I used `\(a\|z\)[x-y]\{10\}\(b\|z\)' as the pattern.
`[x-z]' is MBCSET in UTF-8 locale.

If replace MBCSET to CSET*, the pattern in the superset is
`\(a\|z\)\(CSET*\)\{10\}\(b\|z\)'.
If replace MBCSET to CSET{1,mb_cur_max}, the pattern in the superset is
`\(a\|z\)\(CSET\{1,6\}\)\{10\}\(b\|z\)'.

In order to simplify, use `\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' in
POSIX locale for later.

Before testing, apply the patch attached on this mail, Set `DDEBUG' to
CPPFLAGS, and re-build.  It outputs more debugging information.

  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' 2>debug1.log
  $ yes `printf '%0100d\n'` | head -320 | \
      sed -e '1s/^0/a/; $s/0$/b/' | \
      env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' 2>debug2.log

(`320' originates in the default size of a buffer.)

The formar is only built 3 states.
OTOH, the later is built 224 dfastates.
i.e. even if matched multi-lines, many states aren't necessarily built. 

Next, I checked performance for each case in worst text.
It's best in 10 times.

yes `printf '%0100d\n'` | head -2 | sed -e '1s/^0/a/; $s/0$/b/' >t
yes t | head -1000000 | xargs cat >u

time -p env LANG=C src/grep '\(a\|z\)\(.\{1,6\}\)\{10\}\(b\|z\)' u
    real 1.26       user 0.95       sys 0.28
time -p env LANG=en_US.UTF-8 src/grep '\(a\|z\)[x-y]\{10\}\(b\|z\)' u
    real 0.74       user 0.45       sys 0.27

Though, the later is fast, I like the former, because it's very simple.
the later is complicated, also needs to update nleaves and depth.

Norihiro

Attachment: patch.txt
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]