bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#16966: [PATCH] grep: optimization with the superset of DFA


From: Norihiro Tanaka
Subject: bug#16966: [PATCH] grep: optimization with the superset of DFA
Date: Wed, 02 Apr 2014 22:02:01 +0900

Paolo Bonzini wrote:
> Does anything change if there are a few million c's?

The superset of `a ANYCHAR b' is 'a CSET STAR b'.
It's DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:CSET
s3: The position set is 1:a 2:CSET 3:b (accepted)

        1  accccccccccccccccccccccccccccccccccccccc
           ^ s1

        1  accccccccccccccccccccccccccccccccccccccc
            ^ s2

        1  accccccccccccccccccccccccccccccccccccccc
             ^ s2

              ......

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccccccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the superset because matched on multi-lines.

1,000,000  cccccccccccccccccccccacccccccccccccccccb
           ^ s0


1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                                  ^ s3 accepted

Then re-searched with the original DFA becuase matched on one-line.
The DFA states are following.

s0: The position set is none. 
s1: The position set is 1:a 
s2: The position set is 1:a 2:ANYCHAR
s3: The position set is 1:a 2:ANYCHAR 3:b (accepted)

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                ^ s1

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                 ^ s2

1,000,000  cccccccccccccccccccccacccccccccccccccccb
                                  ^ s0 rejected



Even if matched multi-line, It's still as fast and memory-required as
matched single-line, because in search-mode STAR doesn't count number
of NFA and DFA states up.  So for example, a set of DFA states for
`a CSET STAR b' is correspond with for `a CSET b'. (If my recognition is
right.)

OTOH, If CSET{1,mb_cur_max} is assigned for ANYCHAR, number of DFA
states will be greater than above, and I may be slightly slower
becuase of overheads to build DFA states.  Notice that `{m,n}'
expression requires a lot of memory and is slow (See also
http://www.gnu.org/software/grep/manual/grep.html#Reporting-Bugs).

Example, use the superset for following pattern in EUC-JP locale.
It's mb_cur_max == 3.

    a.....b

If ANYCHAR is replaced to `CSET STAR', as following.

    a CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR CSET STAR b

I draw it below.

        /\
       /  \  2:CSET
       \  /
        \/
       1:a ---> 3:b
(The figure drawn previously was wrong.)

It's equal to following.

    a CSET STAR b

OTOH, the period is replaced to `CSET CSET CSET CAT OR CSET CSET CAT
CSET CAT OR', below.  Itt's complicated, and I don't want to draw it. :)

    a CSET CSET CSET CAT OR CSET CSET CAT CSET CAT OR CSET CSET CSET
    CAT OR CSET CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET
    CSET CAT CSET CAT OR CAT CSET CSET CSET CAT OR CSET CSET CAT CSET
    CAT OR CAT b

Thanks,
Norihiro






reply via email to

[Prev in Thread] Current Thread [Next in Thread]