bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gawk patch for POSIX-conformance floating-point syntax issues


From: Paul Eggert
Subject: gawk patch for POSIX-conformance floating-point syntax issues
Date: Fri, 14 Jan 2005 23:39:58 -0800
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

While looking into adding support for floating-point hexadecimal
constants to gawk, I noticed that gawk has several POSIX-conformance
bugs with ordinary (non-hexadecimal) constants.

Like many languages, POSIX awk has a "maximal munch" policy for
tokenization: when parsing a string like "01e+x", the longest token
that can be an initial prefix of the string is returned, so "01e+x" is
supposed to be parsed as if it were "01 e + x".  However, gawk gets
confused and parses it as if it were "01 x".  In some other cases,
e.g., 010e2, gawk munches the whole token but then misparses only part
of it.

Here is some sample code that gawk mishandled.  Solaris 9 nawk does
the right thing (i.e., the POSIX-conforming thing) in all these cases.

        Assuming these assignments:
        e = "1(e)"
        ex = "3e2(ex)"
        x = "6e5(x)"

        command       correct output     gawk 3.1.4 output
        print 0x      06e5(x)            0                 
        print 0e+x    0600001            0
        print 0ex     03e2(ex)           0
        print 010e2   1000               8
        print 0e9.3   00.3               0

I realize that awk hexadecimal constants are still a bit
controversial; they are required by POSIX 1003.1-2001, but the
specification has bugs and the bugs are still being reviewed by the
POSIX committee.  However, the bugs in the non-hexadecimal constants
should not be controversial: gawk should follow the maximal-munch rule
as POSIX requires.

Here is a patch that combines the two issues.  It contains some code
that I wrote a while ago, but I thought I'd submit it publicly now so
that it gets archived publicly.

If the POSIX committee decides that awk can or must support
hexadecimal constants, this patch can be dropped into gawk 3.1.4
as-is.  If they decide that awk cannot support hexadecimal constants,
about half of it needs to be added and the rest suppressed.  To save
my work, I'd rather defer hacking on this further until the POSIX
folks have made up their minds about hexadecimal constants.

2005-01-14  Paul Eggert  <address@hidden>

        Add support for hexadecimal floating point numbers, and
        fix some bugs in parsing non-hexadecimal numbers.
        
        * awk.h (isnondecimal): Do not consider "08" or "09" to be decimal.
        This is a bit faster here, and lets us tune nondec2awknum.
        * awkgram.y (yylex): Fix bugs when doing maximal munch on
        nontokens like "0e", "0e+", and "0x".  Add support for hexadecimal
        numbers, as POSIX requires.  Don't mishandle cases like "0e9.3".
        Avoid strlen in a couple of cases where the length can be computed
        via subtraction.  Terminate token with '\0', not with garbage
        that almost-always works.
        * builtin.c (nondec2awknum): Assume that input string satisfies
        isnondecimal(...); this simplifies the code.  Add full support
        for hexadecimal constants as per POSIX, if strtod supports them
        as POSIX requires.  Fall back on old gawk behavior (partial support
        for hexadecimal constants) if strtod does not support them.
        Parse constants like 010e0 and 010.0 as decimal, not octal.

        * doc/gawk.texi (Nondecimal-numbers): Hexedecimal code constants
        are no longer a gawk extension, since POSIX requires them now.
        Warn that octal constants are treated as decimal in compatibility mode.

        * test/Makefile.am (EXTRA_DIST): Add float.awk, float.ok.
        (BASIC_TESTS): Add float.
        * test/float.awk, test/float.ok: New files.

--- gawk-3.1.4/awk.h    2004-07-26 07:11:05 -0700
+++ gawk-3.1.4-hexfloat/awk.h   2005-01-14 16:04:12 -0800
@@ -742,7 +742,8 @@ extern char casetable[];    /* for case-ind
 /* ------------------------- Pseudo-functions ------------------------- */
 
 #define is_identchar(c)                (isalnum(c) || (c) == '_')
-#define isnondecimal(str)      (((str)[0]) == '0' && (ISDIGIT((str)[1]) \
+#define isnondecimal(str)      (((str)[0]) == '0' \
+                                && (('0' <= (str)[1] && (str)[1] <= '7') \
                                        || (str)[1] == 'x' || (str)[1] == 'X'))
 
 #define var_uninitialized(n)   ((n)->var_value == Nnull_string)
--- gawk-3.1.4/awkgram.y        2004-07-26 07:11:12 -0700
+++ gawk-3.1.4-hexfloat/awkgram.y       2005-01-14 21:42:13 -0800
@@ -1999,36 +1999,65 @@ retry:
                /* It's a number */
                for (;;) {
                        int gotnumber = FALSE;
+                       int c1;
+                       int c2;
 
                        tokadd(c);
                        switch (c) {
                        case 'x':
                        case 'X':
-                               if (do_traditional)
+                               if ((do_traditional && ! do_posix)
+                                   || tok != tokstart + 2
+                                   || tokstart[0] != '0')
+                                       goto done;
+                               c1 = nextc();
+                               if (c1 == '.') {
+                                       c1 = nextc();
+                                       pushback();
+                               }
+                               pushback();
+                               switch (c1) {
+                               case '0': case '1': case '2': case '3':
+                               case '4': case '5': case '6': case '7':
+                               case '8': case '9':
+                               case 'a': case 'A': case 'b': case 'B':
+                               case 'c': case 'C': case 'D': case 'd':
+                               case 'e': case 'E': case 'f': case 'F':
+                                       break;
+                               default:
                                        goto done;
-                               if (tok == tokstart + 2)
+                               }
                                        inhex = TRUE;
                                break;
                        case '.':
-                               if (seen_point) {
-                                       gotnumber = TRUE;
-                                       break;
-                               }
+                               if (seen_point || seen_e)
+                                       goto done;
                                seen_point = TRUE;
                                break;
+                       case 'p':
+                       case 'P':
+                               if (!inhex)
+                                       goto done;
+                               goto exponent;
                        case 'e':
                        case 'E':
                                if (inhex)
                                        break;
-                               if (seen_e) {
-                                       gotnumber = TRUE;
-                                       break;
+                       exponent:
+                               if (seen_e)
+                                       goto done;
+                               c1 = nextc();
+                               if (c1 == '-' || c1 == '+') {
+                                       c2 = nextc();
+                                       pushback();
+                               } else
+                                       c2 = c1;
+                               if (! ISDIGIT(c2)) {
+                                       pushback();
+                                       goto done;
                                }
+                               tokadd(c1);
                                seen_e = TRUE;
-                               if ((c = nextc()) == '-' || c == '+')
-                                       tokadd(c);
-                               else
-                                       pushback();
                                break;
                        case 'a':
                        case 'A':
@@ -2040,7 +2069,8 @@ retry:
                        case 'd':
                        case 'f':
                        case 'F':
-                               if (do_traditional || ! inhex)
+                               if ((do_traditional && ! do_posix)
+                                   || ! inhex || seen_e)
                                        goto done;
                                /* fall through */
                        case '0':
@@ -2068,15 +2098,15 @@ retry:
                        lintwarn(_("source file does not end in newline"));
                        eof_warned = TRUE;
                }
-               tokadd('\0');
+               *--tok = '\0';
                if (! do_traditional && isnondecimal(tokstart)) {
                        static short warned = FALSE;
-                       if (do_lint && ! warned) {
+                       if (ISDIGIT(tokstart[1]) && do_lint && ! warned) {
                                warned = TRUE;
-                               lintwarn("numeric constant `%.*s' treated as 
octal or hexadecimal",
-                                       strlen(tokstart)-1, tokstart);
+                               lintwarn("numeric constant `%s' treated as 
octal",
+                                       tokstart);
                        }
-                       yylval.nodeval = make_number(nondec2awknum(tokstart, 
strlen(tokstart)));
+                       yylval.nodeval = make_number(nondec2awknum(tokstart, 
tok - tokstart));
                } else
                        yylval.nodeval = make_number(atof(tokstart));
                yylval.nodeval->flags |= PERM;
--- gawk-3.1.4/builtin.c        2004-07-13 00:55:28 -0700
+++ gawk-3.1.4-hexfloat/builtin.c       2005-01-14 22:44:17 -0800
@@ -2799,12 +2799,23 @@ do_strtonum(NODE *tree)
 AWKNUM
 nondec2awknum(char *str, size_t len)
 {
-       AWKNUM retval = 0.0;
+       AWKNUM retval;
+       char *endp;
        char save;
        short val;
-       char *start = str;
 
-       if (*str == '0' && (str[1] == 'x' || str[1] == 'X')) {
+       save = str[len];
+       str[len] = '\0';
+       retval = strtod(str, &endp);
+       str[len] = save;
+
+       if (endp == str + 1) {
+               /*
+                * This must be a number that begins with 0x or 0X.
+                * On pre-C99 hosts, use a poor substitute for C99 strtod,
+                * which does not recognize fractions or exponents.
+                */
+
                /*
                 * User called strtonum("0x") or some such,
                 * so just quit early.
@@ -2847,22 +2858,32 @@ nondec2awknum(char *str, size_t len)
                        }
                        retval = (retval * 16) + val;
                }
-       } else if (*str == '0') {
-               for (; len > 0; len--) {
-                       if (! ISDIGIT(*str))
-                               goto done;
-                       else if (*str == '8' || *str == '9') {
-                               str = start;
-                               goto decimal;
-                       }
-                       retval = (retval * 8) + (*str - '0');
-                       str++;
-               }
        } else {
-decimal:
-               save = str[len];
-               retval = strtod(str, NULL);
-               str[len] = save;
+               /* This must be a number that begins with 00 through 07.  */
+               AWKNUM octalval = str[1] - '0';
+               size_t i;
+               for (i = 2; i < len; i++)
+                       switch (str[i]) {
+                       case '0': case '1': case '2': case '3':
+                       case '4': case '5': case '6': case '7':
+                               octalval = (octalval * 8) + (str[i] - '0');
+                               break;
+
+                       case '8': case '9': case '.':
+                               return retval;
+
+                       case 'e': case 'E':
+                               if (i + 1 < len
+                                   && ISDIGIT(str[i + 1
+                                                  + (i + 2 < len
+                                                     && (str[i + 1] == '-'
+                                                         || str[i + 1] == 
'+'))]))
+                                       return retval;
+                               return octalval;
+                       default:
+                               return octalval;
+                       }
+               return octalval;
        }
 done:
        return retval;
--- gawk-3.1.4/doc/gawk.texi    2004-06-21 07:09:14 -0700
+++ gawk-3.1.4-hexfloat/doc/gawk.texi   2005-01-14 14:43:15 -0800
@@ -7417,11 +7417,10 @@ $ gawk 'BEGIN @{ print "021 is", 021 ; p
 @end example
 
 @cindex compatibility mode (@command{gawk}), octal numbers
address@hidden compatibility mode (@command{gawk}), hexadecimal numbers
-Octal and hexadecimal source code constants are a @command{gawk} extension.
+Octal source code constants are a @command{gawk} extension.
 If @command{gawk} is in compatibility mode
 (@pxref{Options}),
-they are not available.
+all such constants are treated as decimal numbers, as required by POSIX.
 
 @c fakenode --- for prepinfo
 @subheading Advanced Notes: A Constant's Base Does Not Affect Its Value
--- gawk-3.1.4/test/Makefile.am 2004-07-28 06:48:08 -0700
+++ gawk-3.1.4-hexfloat/test/Makefile.am        2005-01-14 22:26:45 -0800
@@ -151,6 +151,8 @@ EXTRA_DIST = \
        fldchgnf.awk \
        fldchgnf.in \
        fldchgnf.ok \
+       float.awk \
+       float.ok \
        fmttest.awk \
        fmttest.ok \
        fnamedat.awk \
@@ -554,7 +556,8 @@ BASIC_TESTS = addcomma anchgsub argarray
        arynocls aryprm1 aryprm2 aryprm3 aryprm4 aryprm5 aryprm6 aryprm7 \
        aryprm8 arysubnm asgext awkpath back89 backgsub childin clobber \
        clsflnam compare compare2 concat1 concat2 concat3 convfmt datanonl 
defref \
-       delarprm delarpm2 delfunc dynlj eofsplit exitval1 fldchg fldchgnf 
fmttest fnamedat \
+       delarprm delarpm2 delfunc dynlj eofsplit exitval1 fldchg fldchgnf \
+       float fmttest fnamedat \
        fnarray fnarray2 fnarydel fnaryscl fnasgnm fnmisc fnparydl \
        fordel forsimp fsbs fsrs fstabplus funsemnl funsmnam funstack getline \
        getline2 getline3 getlnbuf getnr2tb getnr2tm gsubasgn gsubtest \
--- /dev/null   2003-03-18 13:55:57 -0800
+++ gawk-3.1.4-hexfloat/test/float.awk  2005-01-14 22:38:53 -0800
@@ -0,0 +1,27 @@
+BEGIN {
+       e = "1(e)"      ; E = "1(E)"
+       e0 = "2e1(e0)"  ; E0 = "2e1(E0)"
+       ex = "3e2(ex)"  ; EX = "3e2(EX)"
+       p = "4e3(p)"    ; P = "4e3(P)"
+       p0 = "5e4(p0)"  ; P0 = "5e4(P0)"
+       x = "6e5(x)"    ; X = "6e5(X)"
+       xx = "7e6(xx)"  ; XX = "7e6(XX)"
+       print 0x        , 0X
+       print 0xx       , 0XX
+       print 0e+x      , 0E+X
+       print 0e-x      , 0E-X
+       print 0ex       , 0EX
+       print 0e3       , 0E3
+       print 0e-3      , 0E-3
+       print 0e+3      , 0E+3
+       print 0x0e0     , 0X0E0
+       print 010e0     , 010E0
+       print 10p0      , 10P0
+       print 0x.0      , 0X.0
+       print 0x0p0     , 0X0P0
+       print 0x10.p    , 0X10.P
+       print 0x10.p0   , 0X10.P0
+       print 0x10.p-0  , 0X10.P-0
+       print 0x10.p+0  , 0X10.P+0
+       print 0e9.3
+}
--- /dev/null   2003-03-18 13:55:57 -0800
+++ gawk-3.1.4-hexfloat/test/float.ok   2005-01-14 22:39:10 -0800
@@ -0,0 +1,18 @@
+06e5(x) 06e5(X)
+07e6(xx) 07e6(XX)
+0600001 0600001
+0-599999 0-599999
+03e2(ex) 03e2(EX)
+0 0
+0 0
+0 0
+224 224
+10 10
+105e4(p0) 105e4(P0)
+0 0
+0 0
+164e3(p) 164e3(P)
+16 16
+16 16
+16 16
+00.3




reply via email to

[Prev in Thread] Current Thread [Next in Thread]