bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk bug with RS="^..."


From: Aharon Robbins
Subject: Re: gawk bug with RS="^..."
Date: Mon, 03 Jan 2005 11:47:25 +0200

Greetings.  Re this:

> Date: Tue, 14 Dec 2004 14:48:58 +0100
> From: Stepan Kasal <address@hidden>
> Subject: gawk bug with RS="^..."
> To: address@hidden
>
> Hello,
>   I've noticed a problem with "^" in RS in gawk.  In most cases, it seems
> to match only the beginning of the file.  But in fact it matches the
> beginning of gawk's internal buffer.
>
> Observe the following example:
>
> $ gawk 'BEGIN{for(i=1;i<=100;i++) print "Axxxxxx"}' >file
> $ gawk 'BEGIN{RS="^A"} END{print NR}' file
> 2
> $ gawk 'BEGIN{RS="^Ax*\n"} END{print NR}' file
> 100
> $ head file | gawk 'BEGIN{RS="^Ax*\n"} END{print NR}'
> 10
> $
>
> I think this calls for some clarification/fix.  But I don't have any
> fixed opinion how the solution should look like.
>
> Have a nice day,
>         Stepan Kasal
>
> PS: See also the discussion of the issue in the comp.lang.awk newsgroup.

Andreas Schwab's pointer to the `not_bol' flag in the struct pattern_buffer
did the trick.  Below is a patch that fixes the problem. To demonstrate:

        $ ./gawk 'BEGIN{RS="^A"} END{print NR}' file
        2
        $ ./gawk 'BEGIN{RS="^Ax*\n"} END{print NR}' file
        2
        $ head file | gawk 'BEGIN{RS="^Ax*\n"} END{print NR}'
        2

I will be adding the programs and input file to the test suite as well.
Note that line numbers are relative to my development version, but this
patch should apply to gawk 3.1.4 without too much difficulty.

Happy 2005 to everyone!

Arnold
---------------------------------------------------------------------
Mon Jan  3 11:23:36 2005   Arnold D. Robbins    <address@hidden>

        Fix obscure issue. ^ in RS should only match at the very
        beginning of the input.  Essentially, the file is one long
        string.  To do this, use the `not_bol' flag in the `struct
        pattern_buffer'.  Thanks to Stepan Kasal for pointing out the
        problem and to Andreas Schwab for pointing out the mechanism
        for a solution.

        * awk.h (RE_NEED_START, RE_NO_BOL): New flags for `research'.
        (IOP_AT_START): New flag for IOBUF.
        (research): Last parameter is now `flags'.
        * builtin.c (do_match, sub_common): Change calls to `research'.
        * eval.c (interpret, match_op): Same.
        * field.c (re_parse_field): Same.
        * io.c (spec_setup): Add IOP_AT_START flag.
        (iop_alloc): Same.
        (rsrescan): Modify logic to check IOP_AT_START and if not on to
        add RE_NO_BOL to flags value in call to `research'.
        (get_a_record): Clear IOP_AT_START upon return from `*matchrec'.
        (iopflags2str): Add IOP_AT_START to table.  Also IOP_CLOSED,
        which was missing. (Ooops.)
        * re.c (research): Last paramater is now flags.  Modify logic to
        handle RE_NO_BOL case by setting the right bit initially. Clean
        up control flow so that it's cleared before returning.  If RE_NO_BOL,
        don't bother with the dfa matcher, as it doesn't have an analogous
        capability.

diff -ur gawk-3.1.5/awk.h gawk-x/awk.h
--- gawk-3.1.5/awk.h    2004-10-03 22:59:47.000000000 +0200
+++ gawk-x/awk.h        2005-01-03 11:11:03.989913920 +0200
@@ -230,6 +230,9 @@
 #define        SUBPATEND(rp,s,n)       (rp)->regs.end[n]
 #define        NUMSUBPATS(rp,s)        (rp)->regs.num_regs
 #endif /* GNU_REGEX */
+/* regexp matching flags: */
+#define RE_NEED_START  1       /* need to know start/end of match */
+#define RE_NO_BOL      2       /* for RS, not allowed to match ^ in regexp */
 
 /* Stuff for losing systems. */
 #ifdef STRTOD_NOT_C89
@@ -595,6 +598,7 @@
 #              define  IOP_NOFREE_OBJ  8
 #               define  IOP_AT_EOF      16
 #               define  IOP_CLOSED      32
+#               define  IOP_AT_START    64
 } IOBUF;
 
 typedef void (*Func_ptr) P((void));
@@ -1129,7 +1133,7 @@
 /* re.c */
 extern Regexp *make_regexp P((const char *s, size_t len, int ignorecase, int 
dfa));
 extern int research P((Regexp *rp, char *str, int start,
-                      size_t len, int need_start));
+                      size_t len, int flags));
 extern void refree P((Regexp *rp));
 extern void reg_error P((const char *s));
 extern Regexp *re_update P((NODE *t));
diff -ur gawk-3.1.5/builtin.c gawk-x/builtin.c
--- gawk-3.1.5/builtin.c        2004-12-19 17:20:19.000000000 +0200
+++ gawk-x/builtin.c    2005-01-03 11:06:55.930083977 +0200
@@ -1937,7 +1937,7 @@
                assoc_clear(dest);
        }
        
-       rstart = research(rp, t1->stptr, 0, t1->stlen, TRUE);
+       rstart = research(rp, t1->stptr, 0, t1->stlen, RE_NEED_START);
        if (rstart >= 0) {      /* match succeded */
                rstart++;       /* 1-based indexing */
                rlength = REEND(rp, t1->stptr) - RESTART(rp, t1->stptr);
@@ -2141,7 +2141,7 @@
        t = force_string(tree_eval(tmp));
 
        /* do the search early to avoid work on non-match */
-       if (research(rp, t->stptr, 0, t->stlen, TRUE) == -1 ||
+       if (research(rp, t->stptr, 0, t->stlen, RE_NEED_START) == -1 ||
            RESTART(rp, t->stptr) > t->stlen) {
                free_temp(t);
                free_temp(s);
@@ -2347,7 +2347,7 @@
 
                if ((current >= how_many && !global)
                    || ((long) textlen <= 0 && matchstart == matchend)
-                   || research(rp, t->stptr, text - t->stptr, textlen, TRUE) 
== -1)
+                   || research(rp, t->stptr, text - t->stptr, textlen, 
RE_NEED_START) == -1)
                        break;
 
        }
diff -ur gawk-3.1.5/eval.c gawk-x/eval.c
--- gawk-3.1.5/eval.c   2004-11-03 14:55:36.000000000 +0200
+++ gawk-x/eval.c       2005-01-03 11:07:55.423005457 +0200
@@ -517,13 +517,13 @@
                                        NODE *t1;
                                        Regexp *rp;
                                        /* see comments in match_op() code 
about this. */
-                                       int kludge_need_start = FALSE;
+                                       int kludge_need_start = 0;
 
                                        t1 = force_string(switch_value);
                                        rp = re_update(case_stmt->lnode);
 
                                        if (avoid_dfa(tree, t1->stptr, 
t1->stlen))
-                                               kludge_need_start = TRUE;
+                                               kludge_need_start = 
RE_NEED_START;
                                        match_found = (research(rp, t1->stptr, 
0, t1->stlen, kludge_need_start) >= 0);
                                        if (t1 != switch_value)
                                                free_temp(t1);
@@ -2059,7 +2059,7 @@
        register Regexp *rp;
        int i;
        int match = TRUE;
-       int kludge_need_start = FALSE;  /* FIXME: --- see below */
+       int kludge_need_start = 0;      /* FIXME: --- see below */
 
        if (tree->type == Node_nomatch)
                match = FALSE;
@@ -2074,7 +2074,7 @@
         * FIXME:
         *
         * Any place where research() is called with a last parameter of
-        * FALSE, we need to use the avoid_dfa test. This appears here and
+        * zero, we need to use the avoid_dfa test. This appears here and
         * in the code for Node_K_switch.
         *
         * A new or improved dfa that distinguishes beginning/end of
@@ -2084,7 +2084,7 @@
         * The avoid_dfa() function is in re.c; it is not very smart.
         */
        if (avoid_dfa(tree, t1->stptr, t1->stlen))
-               kludge_need_start = TRUE;
+               kludge_need_start = RE_NEED_START;
        i = research(rp, t1->stptr, 0, t1->stlen, kludge_need_start);
        i = (i == -1) ^ (match == TRUE);
        free_temp(t1);
diff -ur gawk-3.1.5/field.c gawk-x/field.c
--- gawk-3.1.5/field.c  2004-07-28 16:41:17.000000000 +0300
+++ gawk-x/field.c      2005-01-03 11:08:04.335346138 +0200
@@ -372,7 +372,7 @@
                        scan++;
        field = scan;
        while (scan < end
-              && research(rp, scan, 0, (end - scan), TRUE) != -1
+              && research(rp, scan, 0, (end - scan), RE_NEED_START) != -1
               && nf < up_to) {
                if (REEND(rp, scan) == RESTART(rp, scan)) {   /* null match */
 #ifdef MBS_SUPPORT
diff -ur gawk-3.1.5/io.c gawk-x/io.c
--- gawk-3.1.5/io.c     2004-12-20 12:07:52.000000000 +0200
+++ gawk-x/io.c 2005-01-03 11:33:16.067692405 +0200
@@ -1433,7 +1433,7 @@
        iop->end = iop->buf + len;
        iop->dataend = iop->end;
        iop->fd = -1;
-       iop->flag = IOP_IS_INTERNAL;
+       iop->flag = IOP_IS_INTERNAL | IOP_AT_START;
 }
 
 /* specfdopen --- open an fd special file */
@@ -2466,7 +2466,7 @@
         iop->off = iop->buf;
         iop->dataend = NULL;
         iop->end = iop->buf + iop->size;
-       iop->flag = 0;
+       iop->flag |= IOP_AT_START;
         return iop;
 }
 
@@ -2661,6 +2661,7 @@
         register char *bp;
         size_t restart = 0, reend = 0;
         Regexp *RSre = RS_regexp;
+       int regex_flags = RE_NEED_START;
 
         memset(recm, '\0', sizeof(struct recmatch));
         recm->start = iop->off;
@@ -2669,9 +2670,11 @@
         if (*state == INDATA)
                 bp += iop->scanoff;
 
+       if ((iop->flag & IOP_AT_START) == 0)
+               regex_flags |= RE_NO_BOL;
 again:
         /* case 1, no match */
-        if (research(RSre, bp, 0, iop->dataend - bp, TRUE) == -1) {
+        if (research(RSre, bp, 0, iop->dataend - bp, regex_flags) == -1) {
                 /* set len, in case this all there is. */
                 recm->len = iop->dataend - iop->off;
                 return NOTERM;
@@ -2883,6 +2886,8 @@
 
                 ret = (*matchrec)(iop, & recm, & state);
 
+               iop->flag &= ~IOP_AT_START;
+
                 if (ret == REC_OK)
                         break;
 
@@ -3118,6 +3123,8 @@
                { IOP_NO_FREE, "IOP_NO_FREE" },
                { IOP_NOFREE_OBJ, "IOP_NOFREE_OBJ" },
                { IOP_AT_EOF,  "IOP_AT_EOF" },
+               { IOP_CLOSED, "IOP_CLOSED" },
+               { IOP_AT_START,  "IOP_AT_START" },
                { 0, NULL }
        };
 
diff -ur gawk-3.1.5/re.c gawk-x/re.c
--- gawk-3.1.5/re.c     2004-11-23 17:27:19.000000000 +0200
+++ gawk-x/re.c 2005-01-03 11:06:07.483108155 +0200
@@ -205,16 +205,28 @@
 
 int
 research(Regexp *rp, register char *str, int start,
-       register size_t len, int need_start)
+       register size_t len, int flags)
 {
        const char *ret = str;
        int try_backref;
+       int need_start;
+       int no_bol;
+       int res;
+
+       need_start = ((flags & RE_NEED_START) != 0);
+       no_bol = ((flags & RE_NO_BOL) != 0);
+
+       if (no_bol)
+               rp->pat.not_bol = 1;
 
        /*
         * Always do dfa search if can; if it fails, then even if
         * need_start is true, we won't bother with the regex search.
+        *
+        * The dfa matcher doesn't have a no_bol flag, so don't bother
+        * trying it in that case.
         */
-       if (rp->dfa) {
+       if (rp->dfa && ! no_bol) {
                char save;
                int count = 0;
                /*
@@ -233,14 +245,15 @@
                         * Passing NULL as last arg speeds up search for cases
                         * where we don't need the start/end info.
                         */
-                       int res = re_search(&(rp->pat), str, start+len,
+                       res = re_search(&(rp->pat), str, start+len,
                                start, len, need_start ? &(rp->regs) : NULL);
-
-                       return res;
                } else
-                       return 1;
+                       res = 1;
        } else
-               return -1;
+               res = -1;
+
+       rp->pat.not_bol = 0;
+       return res;
 }
 
 /* refree --- free up the dynamic memory used by a compiled regexp */




reply via email to

[Prev in Thread] Current Thread [Next in Thread]