bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dfa.h / dfa.c diff versus gawk attached


From: Tony Abou-Assaleh
Subject: Re: dfa.h / dfa.c diff versus gawk attached
Date: Sun, 21 Oct 2007 00:51:00 -0300 (ADT)

Hi Arnold,

Thanks for the clarifications. More questions are below :O)

> Most of the changes in that patch are years old and have been in gawk's
> code for a long time.  The only changes that were put into CVS are
> minor cosmetic ones noted here:

Ah, I see now.

> I had submitted largely the same patch years ago; note the entry in
> the TODO file in the grep source code. :-)

Yep. I've seen the TODO. There are a few things in that file that I need
to get my head around them.

> I am not familiar enough with the grep code to really answer this. Here
> is the history.  GNU grep has both a fast DFA matcher and a slower regex
> matcher.  The DFA matcher cannot match some things (like "\(foo\)bar\1"),
> so it needs both.  Gawk for many years has used the DFA matcher for
> "does it match" kinds of things, falling back to regex for "where does
> it match" operations.

Does this mean that grep uses dfa if it can?

This question has been raised a few times. When does grep use dfa and when
regex? Can anyone on the list shed some light?

> Up to but not including grep 2.5, the DFA matcher was able to match across
> newline boundaries, if one handed it a string with embedded newlines.

Is there a way to pass grep a pattern with an embedded new lines from the
command line? Or do I really have to write some code that uses the dfa
directly to test this?

> This
> is critical for gawk.  At 2.5, someone "simplified" things by removing
> this capability, since grep normally matches only within single lines.
> I very carefully restored the code from the 2.4.x version to do this so
> that I could continue using the DFA matcher.  However, I don't really
> understand the DFA matcher; I worked mostly by careful pattern matching
> of old vs. new code.
>
> The mainline code that invokes dfaexec will have to change. You can see
> in grep 2.4.x how it used to be done.

Ouch. In my opinion, this makes it all the more crucial that we have some
test cases in place. I don't feel comfortable committing code that I don't
understand and can't verify. However, I am willing to work on rectifying
the situation.

> No discussion, sorry, it just had to be done.  The multibyte character
> patches can probably be found individually in the bug-gnu-utils list if
> you search back far enough.

Ouch again. I probably wouldn't be able to recognize it even if I found
it.

Some idea of how we could proceed (in no particular order):

1) Assume that if it works for gawk then it's good enough for grep. Apply
the patch and make the necessary changes to make it work in grep.

2) You break the patch into multiple functional units. I (and others
on the list) try to verify the sub-patches by inspection, and maybe create
some test cases if things start to make sense. We commit sub-patches that
have been reviewed and accepted.

I think 2) is the "right" way of doing it; it follows how grep has been
managed in the past 2 years to some degree. But it is also a lot more
work.

I am willing to go with 1) if I hear consensus from 2-3 grep developers
that this is acceptable.

dfa->broken is used only with ifdef GAWK. I am wondering what implications
this may have for grep? I.e., is there a case where grep may also want to
consider this "broken" case? There are no comments in dfa.c documenting
what is wrong with the DFA that triggers dfa->broken to be set.

We should probably put a comment in dfa.[ch] that all patches should be
sent to grep to keep things in sync. A similar note would go in regex.[ch]
to send patches to gnulib/libc.

Cheers,

TAA

-----------------------------------------------------
Tony Abou-Assaleh
Email:    address@hidden
Web site: http://tony.abou-assaleh.net
----------------------[THE END]----------------------




reply via email to

[Prev in Thread] Current Thread [Next in Thread]