bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars


From: Eli Zaretskii
Subject: Re: [bug-gawk] 4.0.0 Regex Patterns Choke on Exotic Chars
Date: Fri, 30 Sep 2011 16:33:35 +0300

> Date: Mon, 12 Sep 2011 07:19:10 GMT
> From: address@hidden
> Cc: address@hidden, address@hidden
> 
> Otherwise, it looks like a problem with compiling the regular expression.
> Start with make_regexp and keep digging down.  You may want to try
> compiling without optimzatin; I've seen the regex code break optimizers
> before.

No, optimizations have nothing to do with this (I see the problem in a
non-optimized build as well).

This bug is caused by the most mundane and dull issue with mixing
signed and unsigned.  To tell the truth, I never expected to see such
issues in GNU sources that are used for such a long time.

Here's the thing.  The fatal error comes from here:

  regexp();

  if (tok != END)
    dfaerror(_("unbalanced )"));

I.e., dfaparse expects all the string to be exhausted when `regexp'
returns.  In `regexp' we see:

  static void
  regexp (void)
  {
    branch();
    while (tok == OR)
      {
        tok = lex();
        branch();
        addtok(OR);
      }
  }

where `branch' does this:

  static void
  branch (void)
  {
    closure();
    while (tok != RPAREN && tok != OR && tok >= 0)
      {
        closure();
        addtok(CAT);
      }
  }

Note that `branch' terminates the loop when `tok' is negative (and
there are other subroutines of dfa.c that do the same).  Now, `tok'
is an enumerated data type that has a single negative value:

  typedef enum
  {
    END = -1,

    /* Ordinary character values are terminal symbols that match themselves. */

    EMPTY = NOTCHAR,            /* EMPTY is a terminal symbol that matches
    ...

NOTCHAR is 256.  So obviously, `branch' assumes that `tok' will only
be negative when its value is END.  However, `lex' calls FETCH_WC and
FETCH macros that on Windows return negative values for any character
greater than 127.  So the loop ends prematurely, and the rest is
history.

Why do we get negative values from FETCH_WC and FETCH?  Because they
assume that casting to an unsigned type converts a negative value to a
positive one.  But what happens in fact is sign extension, so instead
of 0x95 we get 0xffffff95.  Assigning this to a signed int (because
`tok's return value has the same enumerated type mentioned above,
which must be signed to accommodate for -1) converts back to a
negative value.

I can fix the problem with the following simple patch.  I don't
consider myself an expert on futzing with signed and unsigned values,
so I'll leave it to the experts to figure out The Right Way if this
one isn't.  I did test the patch on GNU/Linux and verified that
David's script works there after applying the patch below.

2011-09-30  Eli Zaretskii  <address@hidden>

        * dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather than
        a sign-extended one.  Fixes a bug on MS-Windows with compiling
        patterns that include characters with the 8-th bit set.
        Reported by David Millis <address@hidden>.

--- dfa.c.orig  2011-06-23 12:27:01.000000000 +0300
+++ dfa.c       2011-09-30 16:06:25.609375000 +0300
@@ -691,19 +691,22 @@ static unsigned char const *buf_end;      /* 
     else                                       \
       {                                                \
         wchar_t _wc;                           \
+        unsigned char uc;                      \
         cur_mb_len = mbrtowc(&_wc, lexptr, lexleft, &mbs); \
         if (cur_mb_len <= 0)                   \
           {                                    \
             cur_mb_len = 1;                    \
             --lexleft;                         \
-            (wc) = (c) = (unsigned char) *lexptr++; \
+            uc = (unsigned char) *lexptr++;    \
+           (wc) = (c) = uc;                    \
           }                                    \
         else                                   \
           {                                    \
             lexptr += cur_mb_len;              \
             lexleft -= cur_mb_len;             \
             (wc) = _wc;                                \
-            (c) = wctob(wc);                   \
+            uc = (unsigned) wctob(wc);         \
+            (c) = uc;                          \
           }                                    \
       }                                                \
   } while(0)
@@ -718,6 +721,7 @@ static unsigned char const *buf_end;        /* 
 /* Note that characters become unsigned here. */
 # define FETCH(c, eoferr)            \
   do {                               \
+    unsigned char uc;                \
     if (! lexleft)                   \
       {                                      \
         if ((eoferr) != 0)           \
@@ -725,7 +729,8 @@ static unsigned char const *buf_end;        /* 
         else                         \
           return lasttok = END;              \
       }                                      \
-    (c) = (unsigned char) *lexptr++;  \
+    uc = (unsigned char) *lexptr++;   \
+    (c) = uc;                        \
     --lexleft;                       \
   } while(0)
 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]