bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Computed regex and getline bug / issue


From: Andrew J. Schorr
Subject: Re: [bug-gawk] Computed regex and getline bug / issue
Date: Tue, 6 May 2014 10:18:21 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

On Mon, May 05, 2014 at 09:04:23AM +0300, Aharon Robbins wrote:
> It is a heuristic. Consider an RS like what we have: RS = ",+".  Here,
> we want as many commas as we can possibly slurp up.  Now consider a file
> like so, where the | indicates a file block boundary:
> 
>       .... ,,, | ,, ...
> 
> rsrescan has seen the first three commas, but it doesn't know if the
> next block starts with a comma, or with something else.  So it tells
> get_a_record, "read some more data and retry", in case there's more
> stuff that could be matched.
> 
> This was done to solve a real problem I encountered, where something
> like   foo(bar)*  was the RS and the "foo" fell exactly on the end of
> the block boundary; even though there was a "bar" at the beginning of
> the next block, gawk wasn't picking it up.

That makes some sense conceptually.  I think it would be wise to have
test cases for both of these problems.

I tried to make a test case for the problem you describe, and I am
not having any luck.  Can you see what I'm doing wrong?  The input
file blockboundary.in is attached.  

With gawk 4.1.1:

bash-4.2$ AWKBUFSIZE=7 /bin/gawk -v "RS=foo(bar)*" 1  < blockboundary.in
cats
dogs
mice
bats

bash-4.2$ AWKBUFSIZE=7 /bin/gawk -v "RS=foo(bar)*" '{print; rc = getline; print 
rc; print}' < blockboundary.in
cats
1
dogs
mice
0
mice

With my patch to prevent rsrescan from returning TERMNEAREND:

bash-4.2$ AWKBUFSIZE=7 ./gawk -v "RS=foo(bar)*" 1  < blockboundary.in
cats
dogs
mice
bats
bash-4.2$ AWKBUFSIZE=7 ./gawk -v "RS=foo(bar)*" '{print; rc = getline; print 
rc; print}' < blockboundary.in
cats
1
dogs
mice
1
bats

Using strace, I see that gawk seems to do a certain amount of readahead in any 
case:

bash-4.2$ AWKBUFSIZE=7 strace -eread ./gawk -v "RS=foo(bar)*" 1  < 
blockboundary.in
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0PO\1\0\0\0\0\0"..., 832) 
= 832
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\252\0\0\0\0\0\0"..., 832) = 
832
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\300\0\0\0\0\0\0"..., 832) = 
832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\16\0\0\0\0\0\0"..., 
832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260T\0\0\0\0\0\0"..., 
832) = 832
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"..., 
832) = 832
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>address@hidden"..., 832) = 832
read(0, "catsfoo", 7)                   = 7
read(0, "dogsfoo", 7)                   = 7
cats
read(0, "micefoo", 7)                   = 7
dogs
read(0, "barbats", 7)                   = 7
mice
read(0, "", 7)                          = 0
bats
+++ exited with 0 +++

Do you have any thoughts on how to construct a test case that will show
the TERMNEAREND problem?

Regards,
Andy

Attachment: blockboundary.in
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]