bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sync dfaexec from gawk: fixes real bugs in grep


From: Jim Meyering
Subject: Re: sync dfaexec from gawk: fixes real bugs in grep
Date: Fri, 12 Mar 2010 15:30:19 +0100

Paolo Bonzini wrote:
>> +  *end = eol;
>
> As a follow-up I'd make this *end++ = eol, so that the code below
> either simplifies (fewer "+ 1") or becomes more idiomatic
> (e.g. comparing "< end" and ">= end" instead of <= and >).

I have mixed feelings about that.
On one hand, it seems like it'd make the code more idiomatic,
as you say.  On the other, though, and perhaps more importantly,
it would obscure the fact that the appended byte is just a sentinel
and not part of the input buffer per se.

In addition, while resigned to it for now, I'm still against the fact
that this new API makes "*END" a non-const pointer, and would like to
reserve the right to fix that later.  Making the above change would
serve to entrench the new API.

>> -      MALLOC(mblen_buf, unsigned char, end - (unsigned char const *)begin + 
>> 2);
>> -      MALLOC(inputwcs, wchar_t, end - (unsigned char const *)begin + 2);
>> +      MALLOC(mblen_buf, unsigned char, end - begin + 2);
>> +      MALLOC(inputwcs, wchar_t, end - begin + 2);
>
> Should this become + 3?

No, + 2 is fine.
Here is the code that modifies those buffers:

      MALLOC(mblen_buf, unsigned char, end - (unsigned char const *)begin + 2);
      MALLOC(inputwcs, wchar_t, end - (unsigned char const *)begin + 2);
      memset(&mbs, 0, sizeof(mbstate_t));
      remain_bytes = 0;
      for (i = 0; i < end - begin + 1; i++)
        {
          ...
          inputwcs[i] = ...
          mblen_buf[i] = ...
          ...
        }
      inputwcs[i] = 0;
      mblen_buf[i] = 0;


>> +        if ((char *) p>  end)
>> +          break;
>
> So the sentinel doesn't work when doing MB matching?

Why do you ask that?

>> +      while ((t = trans[s]) != 0) { /* hand-optimized loop */
>
> Would you change this comment to "hand-unrolled loop"?

I presume that was the intent.  Already, I was tempted to remove it.
I would change several things (formatting first ;-)
about that code, eventually.  If I understood it better, I would
add a comment there, too.  For now, I'm still in the process of syncing
from gawk.

>> +      /* If the previous character was a newline, count it. */
>> +      if (count&&  (char *) p<= end&&  p[-1] == eol)
>> +    ++*count;

[BTW, your mail client seems to be mangling the spacing of quoted code.]

> Probably present in gawk too but, why p <= end?  That would be always
> true due to the sentinel; wouldn't the right test be p > begin?

Currently, dfaexec is not even called for an empty line.

When P == end, we want to know if P[-1] (the last byte)
is a newline.  Similarly for any smaller P.
This eliminates the P == end + 1 case, in which the sentinel
would give a false positive.
P actually does exceed "end" in some cases.

> Or, what happens if you grep in a file that starts with '\n'?

For now, grep never sets the COUNT pointer, so that is dead code.

>> * src/search.c: Adjust to new dfaexec API.
>> Now, dfaexec returns a pointer, not an integer,
>> and the third parameter is END, not buffer size.
>
> I suppose you'd squash 1/3 and 2/3 for bisectability?  Or could the
> search.c part be applied already to 1/3?

I'd rather not squash them.  1/3 is already so large...
This is probably a good case for a merge commit: i.e.,
push a branch with those two commits, and then merge that
onto master.  Then we have separate commits, yet the result
is still bisectable.

>> +      *end = saved_end;                             \
>
> Do we need to restore this or can we treat it as a scratch area?

In grep, we must restore it, since it is the first byte of
the following buffer.
In gawk, it is not currently required.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]