[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
gawk patch for POSIX-conformance floating-point syntax issues
From: |
Paul Eggert |
Subject: |
gawk patch for POSIX-conformance floating-point syntax issues |
Date: |
Fri, 14 Jan 2005 23:39:58 -0800 |
User-agent: |
Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux) |
While looking into adding support for floating-point hexadecimal
constants to gawk, I noticed that gawk has several POSIX-conformance
bugs with ordinary (non-hexadecimal) constants.
Like many languages, POSIX awk has a "maximal munch" policy for
tokenization: when parsing a string like "01e+x", the longest token
that can be an initial prefix of the string is returned, so "01e+x" is
supposed to be parsed as if it were "01 e + x". However, gawk gets
confused and parses it as if it were "01 x". In some other cases,
e.g., 010e2, gawk munches the whole token but then misparses only part
of it.
Here is some sample code that gawk mishandled. Solaris 9 nawk does
the right thing (i.e., the POSIX-conforming thing) in all these cases.
Assuming these assignments:
e = "1(e)"
ex = "3e2(ex)"
x = "6e5(x)"
command correct output gawk 3.1.4 output
print 0x 06e5(x) 0
print 0e+x 0600001 0
print 0ex 03e2(ex) 0
print 010e2 1000 8
print 0e9.3 00.3 0
I realize that awk hexadecimal constants are still a bit
controversial; they are required by POSIX 1003.1-2001, but the
specification has bugs and the bugs are still being reviewed by the
POSIX committee. However, the bugs in the non-hexadecimal constants
should not be controversial: gawk should follow the maximal-munch rule
as POSIX requires.
Here is a patch that combines the two issues. It contains some code
that I wrote a while ago, but I thought I'd submit it publicly now so
that it gets archived publicly.
If the POSIX committee decides that awk can or must support
hexadecimal constants, this patch can be dropped into gawk 3.1.4
as-is. If they decide that awk cannot support hexadecimal constants,
about half of it needs to be added and the rest suppressed. To save
my work, I'd rather defer hacking on this further until the POSIX
folks have made up their minds about hexadecimal constants.
2005-01-14 Paul Eggert <address@hidden>
Add support for hexadecimal floating point numbers, and
fix some bugs in parsing non-hexadecimal numbers.
* awk.h (isnondecimal): Do not consider "08" or "09" to be decimal.
This is a bit faster here, and lets us tune nondec2awknum.
* awkgram.y (yylex): Fix bugs when doing maximal munch on
nontokens like "0e", "0e+", and "0x". Add support for hexadecimal
numbers, as POSIX requires. Don't mishandle cases like "0e9.3".
Avoid strlen in a couple of cases where the length can be computed
via subtraction. Terminate token with '\0', not with garbage
that almost-always works.
* builtin.c (nondec2awknum): Assume that input string satisfies
isnondecimal(...); this simplifies the code. Add full support
for hexadecimal constants as per POSIX, if strtod supports them
as POSIX requires. Fall back on old gawk behavior (partial support
for hexadecimal constants) if strtod does not support them.
Parse constants like 010e0 and 010.0 as decimal, not octal.
* doc/gawk.texi (Nondecimal-numbers): Hexedecimal code constants
are no longer a gawk extension, since POSIX requires them now.
Warn that octal constants are treated as decimal in compatibility mode.
* test/Makefile.am (EXTRA_DIST): Add float.awk, float.ok.
(BASIC_TESTS): Add float.
* test/float.awk, test/float.ok: New files.
--- gawk-3.1.4/awk.h 2004-07-26 07:11:05 -0700
+++ gawk-3.1.4-hexfloat/awk.h 2005-01-14 16:04:12 -0800
@@ -742,7 +742,8 @@ extern char casetable[]; /* for case-ind
/* ------------------------- Pseudo-functions ------------------------- */
#define is_identchar(c) (isalnum(c) || (c) == '_')
-#define isnondecimal(str) (((str)[0]) == '0' && (ISDIGIT((str)[1]) \
+#define isnondecimal(str) (((str)[0]) == '0' \
+ && (('0' <= (str)[1] && (str)[1] <= '7') \
|| (str)[1] == 'x' || (str)[1] == 'X'))
#define var_uninitialized(n) ((n)->var_value == Nnull_string)
--- gawk-3.1.4/awkgram.y 2004-07-26 07:11:12 -0700
+++ gawk-3.1.4-hexfloat/awkgram.y 2005-01-14 21:42:13 -0800
@@ -1999,36 +1999,65 @@ retry:
/* It's a number */
for (;;) {
int gotnumber = FALSE;
+ int c1;
+ int c2;
tokadd(c);
switch (c) {
case 'x':
case 'X':
- if (do_traditional)
+ if ((do_traditional && ! do_posix)
+ || tok != tokstart + 2
+ || tokstart[0] != '0')
+ goto done;
+ c1 = nextc();
+ if (c1 == '.') {
+ c1 = nextc();
+ pushback();
+ }
+ pushback();
+ switch (c1) {
+ case '0': case '1': case '2': case '3':
+ case '4': case '5': case '6': case '7':
+ case '8': case '9':
+ case 'a': case 'A': case 'b': case 'B':
+ case 'c': case 'C': case 'D': case 'd':
+ case 'e': case 'E': case 'f': case 'F':
+ break;
+ default:
goto done;
- if (tok == tokstart + 2)
+ }
inhex = TRUE;
break;
case '.':
- if (seen_point) {
- gotnumber = TRUE;
- break;
- }
+ if (seen_point || seen_e)
+ goto done;
seen_point = TRUE;
break;
+ case 'p':
+ case 'P':
+ if (!inhex)
+ goto done;
+ goto exponent;
case 'e':
case 'E':
if (inhex)
break;
- if (seen_e) {
- gotnumber = TRUE;
- break;
+ exponent:
+ if (seen_e)
+ goto done;
+ c1 = nextc();
+ if (c1 == '-' || c1 == '+') {
+ c2 = nextc();
+ pushback();
+ } else
+ c2 = c1;
+ if (! ISDIGIT(c2)) {
+ pushback();
+ goto done;
}
+ tokadd(c1);
seen_e = TRUE;
- if ((c = nextc()) == '-' || c == '+')
- tokadd(c);
- else
- pushback();
break;
case 'a':
case 'A':
@@ -2040,7 +2069,8 @@ retry:
case 'd':
case 'f':
case 'F':
- if (do_traditional || ! inhex)
+ if ((do_traditional && ! do_posix)
+ || ! inhex || seen_e)
goto done;
/* fall through */
case '0':
@@ -2068,15 +2098,15 @@ retry:
lintwarn(_("source file does not end in newline"));
eof_warned = TRUE;
}
- tokadd('\0');
+ *--tok = '\0';
if (! do_traditional && isnondecimal(tokstart)) {
static short warned = FALSE;
- if (do_lint && ! warned) {
+ if (ISDIGIT(tokstart[1]) && do_lint && ! warned) {
warned = TRUE;
- lintwarn("numeric constant `%.*s' treated as
octal or hexadecimal",
- strlen(tokstart)-1, tokstart);
+ lintwarn("numeric constant `%s' treated as
octal",
+ tokstart);
}
- yylval.nodeval = make_number(nondec2awknum(tokstart,
strlen(tokstart)));
+ yylval.nodeval = make_number(nondec2awknum(tokstart,
tok - tokstart));
} else
yylval.nodeval = make_number(atof(tokstart));
yylval.nodeval->flags |= PERM;
--- gawk-3.1.4/builtin.c 2004-07-13 00:55:28 -0700
+++ gawk-3.1.4-hexfloat/builtin.c 2005-01-14 22:44:17 -0800
@@ -2799,12 +2799,23 @@ do_strtonum(NODE *tree)
AWKNUM
nondec2awknum(char *str, size_t len)
{
- AWKNUM retval = 0.0;
+ AWKNUM retval;
+ char *endp;
char save;
short val;
- char *start = str;
- if (*str == '0' && (str[1] == 'x' || str[1] == 'X')) {
+ save = str[len];
+ str[len] = '\0';
+ retval = strtod(str, &endp);
+ str[len] = save;
+
+ if (endp == str + 1) {
+ /*
+ * This must be a number that begins with 0x or 0X.
+ * On pre-C99 hosts, use a poor substitute for C99 strtod,
+ * which does not recognize fractions or exponents.
+ */
+
/*
* User called strtonum("0x") or some such,
* so just quit early.
@@ -2847,22 +2858,32 @@ nondec2awknum(char *str, size_t len)
}
retval = (retval * 16) + val;
}
- } else if (*str == '0') {
- for (; len > 0; len--) {
- if (! ISDIGIT(*str))
- goto done;
- else if (*str == '8' || *str == '9') {
- str = start;
- goto decimal;
- }
- retval = (retval * 8) + (*str - '0');
- str++;
- }
} else {
-decimal:
- save = str[len];
- retval = strtod(str, NULL);
- str[len] = save;
+ /* This must be a number that begins with 00 through 07. */
+ AWKNUM octalval = str[1] - '0';
+ size_t i;
+ for (i = 2; i < len; i++)
+ switch (str[i]) {
+ case '0': case '1': case '2': case '3':
+ case '4': case '5': case '6': case '7':
+ octalval = (octalval * 8) + (str[i] - '0');
+ break;
+
+ case '8': case '9': case '.':
+ return retval;
+
+ case 'e': case 'E':
+ if (i + 1 < len
+ && ISDIGIT(str[i + 1
+ + (i + 2 < len
+ && (str[i + 1] == '-'
+ || str[i + 1] ==
'+'))]))
+ return retval;
+ return octalval;
+ default:
+ return octalval;
+ }
+ return octalval;
}
done:
return retval;
--- gawk-3.1.4/doc/gawk.texi 2004-06-21 07:09:14 -0700
+++ gawk-3.1.4-hexfloat/doc/gawk.texi 2005-01-14 14:43:15 -0800
@@ -7417,11 +7417,10 @@ $ gawk 'BEGIN @{ print "021 is", 021 ; p
@end example
@cindex compatibility mode (@command{gawk}), octal numbers
address@hidden compatibility mode (@command{gawk}), hexadecimal numbers
-Octal and hexadecimal source code constants are a @command{gawk} extension.
+Octal source code constants are a @command{gawk} extension.
If @command{gawk} is in compatibility mode
(@pxref{Options}),
-they are not available.
+all such constants are treated as decimal numbers, as required by POSIX.
@c fakenode --- for prepinfo
@subheading Advanced Notes: A Constant's Base Does Not Affect Its Value
--- gawk-3.1.4/test/Makefile.am 2004-07-28 06:48:08 -0700
+++ gawk-3.1.4-hexfloat/test/Makefile.am 2005-01-14 22:26:45 -0800
@@ -151,6 +151,8 @@ EXTRA_DIST = \
fldchgnf.awk \
fldchgnf.in \
fldchgnf.ok \
+ float.awk \
+ float.ok \
fmttest.awk \
fmttest.ok \
fnamedat.awk \
@@ -554,7 +556,8 @@ BASIC_TESTS = addcomma anchgsub argarray
arynocls aryprm1 aryprm2 aryprm3 aryprm4 aryprm5 aryprm6 aryprm7 \
aryprm8 arysubnm asgext awkpath back89 backgsub childin clobber \
clsflnam compare compare2 concat1 concat2 concat3 convfmt datanonl
defref \
- delarprm delarpm2 delfunc dynlj eofsplit exitval1 fldchg fldchgnf
fmttest fnamedat \
+ delarprm delarpm2 delfunc dynlj eofsplit exitval1 fldchg fldchgnf \
+ float fmttest fnamedat \
fnarray fnarray2 fnarydel fnaryscl fnasgnm fnmisc fnparydl \
fordel forsimp fsbs fsrs fstabplus funsemnl funsmnam funstack getline \
getline2 getline3 getlnbuf getnr2tb getnr2tm gsubasgn gsubtest \
--- /dev/null 2003-03-18 13:55:57 -0800
+++ gawk-3.1.4-hexfloat/test/float.awk 2005-01-14 22:38:53 -0800
@@ -0,0 +1,27 @@
+BEGIN {
+ e = "1(e)" ; E = "1(E)"
+ e0 = "2e1(e0)" ; E0 = "2e1(E0)"
+ ex = "3e2(ex)" ; EX = "3e2(EX)"
+ p = "4e3(p)" ; P = "4e3(P)"
+ p0 = "5e4(p0)" ; P0 = "5e4(P0)"
+ x = "6e5(x)" ; X = "6e5(X)"
+ xx = "7e6(xx)" ; XX = "7e6(XX)"
+ print 0x , 0X
+ print 0xx , 0XX
+ print 0e+x , 0E+X
+ print 0e-x , 0E-X
+ print 0ex , 0EX
+ print 0e3 , 0E3
+ print 0e-3 , 0E-3
+ print 0e+3 , 0E+3
+ print 0x0e0 , 0X0E0
+ print 010e0 , 010E0
+ print 10p0 , 10P0
+ print 0x.0 , 0X.0
+ print 0x0p0 , 0X0P0
+ print 0x10.p , 0X10.P
+ print 0x10.p0 , 0X10.P0
+ print 0x10.p-0 , 0X10.P-0
+ print 0x10.p+0 , 0X10.P+0
+ print 0e9.3
+}
--- /dev/null 2003-03-18 13:55:57 -0800
+++ gawk-3.1.4-hexfloat/test/float.ok 2005-01-14 22:39:10 -0800
@@ -0,0 +1,18 @@
+06e5(x) 06e5(X)
+07e6(xx) 07e6(XX)
+0600001 0600001
+0-599999 0-599999
+03e2(ex) 03e2(EX)
+0 0
+0 0
+0 0
+224 224
+10 10
+105e4(p0) 105e4(P0)
+0 0
+0 0
+164e3(p) 164e3(P)
+16 16
+16 16
+16 16
+00.3
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- gawk patch for POSIX-conformance floating-point syntax issues,
Paul Eggert <=