bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-grep] Bugs in the DFA of grep


From: Olaf Seibert
Subject: [bug-grep] Bugs in the DFA of grep
Date: Wed, 6 Apr 2005 10:56:04 +0200
User-agent: Mutt/1.3.25i

The DFA as implemented in dfa.c from GNU grep-2.5.1a does not quite what
is documented in the source code.

I have been looking for a paper describing the method of construction of
this DFA, so I can compare it to the implementation, but haven't found
any. Can you point me in the right direction?

Consider this:

  int searchflag;               /* True if we are supposed to build a searching
                                   as opposed to an exact matcher.  A searching
                                   matcher finds the first and shortest string
                                   matching a regexp anywhere in the buffer,
                                   whereas an exact matcher finds the longest
                                   string matching, but anchored to the
                                   beginning of the buffer. */

Unfortunately, I discovered through a little instrumentation that an
"exact" matcher still does the same thing as a "searching" one.

This is the main loop in dfaexec():

/* Search through a buffer looking for a match to the given struct dfa.
   Find the first occurrence of a string matching the regexp in the buffer,
   and the shortest possible version thereof.  Return the offset of the first
   character after the match, or (size_t) -1 if none is found.  BEGIN points to
   the beginning of the buffer, and SIZE is the size of the buffer.  If SIZE
   is nonzero, BEGIN[SIZE - 1] must be a newline.  BACKREF points to a place
   where we're supposed to store a 1 if backreferencing happened and the
   match needs to be verified by a backtracking matcher.  Otherwise
   we store a 0 in *backref. */

  for (;;)
    {
        while ((t = trans[s])) {
          s = t[*p++];
#if DEBUG >= 2
          fprintf(stderr, "s: '%c' -> %d\n", p[-1], s);
#endif
        }

      if (s < 0)
        {
          if (p == end)
            {
#if DEBUG >= 2
            fprintf(stderr, "fail\n");
#endif
              return (size_t) -1;
            }
          s = 0;
        }
      else if ((t = d->fails[s]))
        {
          if (d->success[s] & sbit[*p])
            {
              if (backref)
                *backref = (d->states[s].backref != 0);
                return (char const *) p - begin;
            }

          s = t[*p++];
#if DEBUG >= 2
          fprintf(stderr, "s:.'%c' -> %d\n", p[-1], s);
#endif
        }
      else
        {
          build_state(s, d);
          trans = d->trans;
        }
    }

The upper loop,  while ((t = trans[s])), is, as it turns out, exited
when s == -1. However, in that case there is another test, if (p ==
end), that determines if the DFA will really stop.

In the first place, it is uncommon to require that BEGIN[SIZE - 1] must
be a newline: BEGIN[SIZE] would be natural. Moreover, the clause "If
SIZE is nonzero" does not seem true, since SIZE is used regardless.  And
if the terminator character is not in the "right" place, it will happily
continue all over memory.

In the second place, this also means the DFA will never fail until it
has examined the complete input. If it continues it gets reset to the
initial state. Therefore it is ALWAYS a searching matcher.

The fix is simple: remove the test, always return when s < 0.
If 'searchflag' is set, the DFA is constructed such that it never goes
to state -1 anyway (until the end).

In the third place, the code does not return the longest match, but
always the shortest, also contrary to the description.

Fortunately this is also fairly easy to fix, since the accepting states
do not necessarily terminate but may continue. So the DFA can remember a
match at that location, and when it finally fails it returns the latest
(and therefore longest) match it has seen.

Oh, and the indentation/brace location is evil.

So the end result is:

    int bestmatch = -1;

    for (;;) {
        while ((t = trans[s])) {
            s = t[*p++];
        }

        if (s < 0) {
            /*
            * if (p == end) removed by Olaf S: otherwise it would always
            * be a searching matcher.
            */
            return bestmatch;
        } else if ((t = d->fails[s])) {
            if (d->success[s] & sbit[*p]) {
                if (backref)
                    *backref = (d->states[s].backref != 0);
                bestmatch = (char const *) p - begin;
                if (!d->longest_match)  /* condition added by Olaf S */
                    return bestmatch;
            }

            s = t[*p++];
        } else {
            build_state (s, d);
            trans = d->trans;
        }
    }


-Olaf.
-- 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]