bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dfa - gawk matching problem on windows and suggested fix


From: Jim Meyering
Subject: Re: dfa - gawk matching problem on windows and suggested fix
Date: Tue, 04 Oct 2011 08:30:02 +0200

Eli Zaretskii wrote:
>> From: Jim Meyering <address@hidden>
>> Cc: address@hidden,  address@hidden
>> Date: Mon, 03 Oct 2011 18:41:25 +0200
>>
>> > This version of wctob solves the problem.
>>
>> Good.  Thanks for confirming that.
>> Then I suggest that users of dfa.c like gawk arrange to use that.
>> grep and any users that (by use of gnulib) can be assured of a working
>> wctob do not need to change dfa.c to work around that bug.
>>
>> However, while current wctob configure-time tests in gnulib
>> do detect some wctob problems, I don't see a test for this one.
>> Hence, if you can confirm that this also causes a problem with grep,
>> I'll work with you to add a configure-time test in gnulib
>> so that gnulib-using projects also replace that system's wctob.
>
> It will take time for me to look in grep, because I'd need to build my
> own binary from sources.
>
> For Gawk, the configure-time test is not going to solve the problem on
> Windows because the Windows port of Gawk does not use the configure
> script, it is built using a separately maintained Makefile.  So for
> Gawk, I can simply put the replacement wctob on a Windows-specific
> file (which exists anyway, for other functions that need wrappers or
> replacements).

FYI, this is what I'm going to push.
The only piece lacking is the [...] note in NEWS where I
normally document in which version the bug was introduced.
Since I have been unable to reproduce it, I haven't bothered
to try to deduce when it was introduced.


>From 7d20c09e3e7cf3af9060f395e884fca285ce3598 Mon Sep 17 00:00:00 2001
From: Eli Zaretskii <address@hidden>
Date: Sun, 2 Oct 2011 21:33:53 +0200
Subject: [PATCH] dfa: don't mishandle high-bit bytes in a regexp with
 signed-char

This appears to arise only on systems for which "char" is signed.
* src/dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather
than a sign-extended one.  Fixes a bug on MS-Windows with compiling
patterns that include characters with the 8-th bit set.
(to_uchar): Define.  From coreutils.
Reported by David Millis <address@hidden>.
See http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3893
* NEWS (Bug fixes): Mention it.
---
 NEWS      |    5 +++++
 src/dfa.c |    9 +++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 8578e82..2b06af4 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,11 @@ GNU grep NEWS                                    -*- outline 
-*-

 * Noteworthy changes in release ?.? (????-??-??) [?]

+** Bug fixes
+
+  grep no longer mishandles high-bit-set pattern bytes on systems
+  where "char" is a signed type. [bug appears to affect only MS-Windows]
+
   grep now rejects a command like "grep -r pattern . > out",
   in which the output file is also one of the inputs,
   because it can result in an "infinite" disk-filling loop.
diff --git a/src/dfa.c b/src/dfa.c
index 8611435..dc87915 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -86,6 +86,11 @@
 /* Sets of unsigned characters are stored as bit vectors in arrays of ints. */
 typedef int charclass[CHARCLASS_INTS];

+/* Convert a possibly-signed character to an unsigned character.  This is
+   a bit safer than casting to unsigned char, since it catches some type
+   errors that the cast doesn't.  */
+static inline unsigned char to_uchar (char ch) { return ch; }
+
 /* Sometimes characters can only be matched depending on the surrounding
    context.  Such context decisions depend on what the previous character
    was, and the value of the current (lookahead) character.  Context
@@ -686,7 +691,7 @@ static unsigned char const *buf_end;        /* reference to 
end in dfaexec().  */
           {                                    \
             cur_mb_len = 1;                    \
             --lexleft;                         \
-            (wc) = (c) = (unsigned char) *lexptr++; \
+            (wc) = (c) = to_uchar (*lexptr++);  \
           }                                    \
         else                                   \
           {                                    \
@@ -715,7 +720,7 @@ static unsigned char const *buf_end;        /* reference to 
end in dfaexec().  */
         else                         \
           return lasttok = END;              \
       }                                      \
-    (c) = (unsigned char) *lexptr++;  \
+    (c) = to_uchar (*lexptr++);       \
     --lexleft;                       \
   } while(0)

--
1.7.7.rc0.362.g5a14



reply via email to

[Prev in Thread] Current Thread [Next in Thread]