bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-grep] Boyer Moore overflow patch


From: Julian Foad
Subject: Re: [bug-grep] Boyer Moore overflow patch
Date: Wed, 15 Jun 2005 01:45:10 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b) Gecko/20050217

I'm looking back at this post from 22 November 2004:

Stepan Kasal wrote:
  some time ago I have created a patch to kwset.c, which fixes an
overflow in the Boyer-Moore algorithm.  According to our rules, I'm
posting it here for review before I check it in.

In a follow-up post, Stepan provided three test cases:

The attached tar file [grep_bm_bug.tar.gz] contains three test cases:

subdir 1/         grep -f pat txt  # shows a false match
subdir 2/         grep -f pat txt  # misses a match
subdir fgrepbug/  grep -f pat badfile   # misses a match

All these bugs work [i.e. the tests fail] with grep, fgrep and egrep--grep is 
smart enough to see
that there are no meta characters in the pattern.

The last example comes from bug report
http://lists.gnu.org/archive/html/bug-gnu-utils/2003-02/msg00105.html
so we should credit Steve Summit for it, and for reporting the bug.

Example 2/ is the same as fgrepbug, but it artificially created.

I can supply more details how the bug works, if you want.

I have incorporated the (essentially two) tests into tests/foad1.sh.

------------------------------------------------------------------------

Tue Feb 11 15:23:30 CET 2003  Stepan Kasal  <address@hidden>

        * src/kwset.c (MIN): New macro.
          (kwsprep): Use MIN to conveniently fill delta[] with initial
          values; use it again when filling delta[] with the actual values
          ---this fixes a weirdly looking overflow bug if the shift is a
          multiple of 256; use a bit different code to detrmine mind2.

I'm going to ask for a bit of a higher-level explanation in the log message. The last clause, "use a bit different code to detrmine mind2" doesn't really say anything useful. The reader asks, "Why?". "Use more efficient code to determine mind2" would be better. The central clause, about the bug fix, doesn't give enough indication for anyone but a Grep expert to know what bug is being fixed. Adding a regression test as part of the same commit would be a good way to make it clear, and I know Stepan was intending to do so, and was asking for help with that.

Now to review the patch itself:

diff -urpN grep-2.5.1/src/kwset.c grep-2.5.1.kw/src/kwset.c
--- grep-2.5.1/src/kwset.c      Thu Feb  8 16:06:36 2001
+++ grep-2.5.1.kw/src/kwset.c   Tue Feb 11 15:18:35 2003
@@ -46,6 +46,9 @@ extern char *xmalloc();
 #define obstack_chunk_alloc malloc
 #define obstack_chunk_free free
+#undef MIN
+#define MIN(A,B) ((A) < (B) ? (A) : (B))
+
 /* Balanced tree of edges and labels leaving a given trie node. */
 struct tree
 {
@@ -386,12 +389,8 @@ kwsprep (kwset_t kws)
   /* Initial values for the delta table; will be changed later.  The
      delta entry for a given character is the smallest depth of any
      node at which an outgoing edge is labeled by that character. */
-  if (kwset->mind < 256)
-    for (i = 0; i < NCHAR; ++i)
-      delta[i] = kwset->mind;
-  else
-    for (i = 0; i < NCHAR; ++i)
-      delta[i] = 255;
+  for (i = 0; i < NCHAR; ++i)
+    delta[i] = MIN(kwset->mind, UCHAR_MAX);

This change appears to be a no-op, making the source code neater but potentially slower (depending on how good the compiler's optimiser is). If we really want to make this change at the same time as fixing the bug, isn't "memset" neater and guaranteed to be fast?

    memset(delta, MIN(kwset->mind, UCHAR_MAX), NCHAR);

/* Check if we can use the simple boyer-moore algorithm, instead
      of the hairy commentz-walter algorithm. */
@@ -404,15 +403,18 @@ kwsprep (kwset_t kws)
          kwset->target[i] = curr->links->label;
          curr = curr->links->trie;
        }
-      /* Build the Boyer Moore delta.  Boy that's easy compared to CW. */
+      /* Build the Boyer-Moore delta.  Boy that's easy compared to CW.
+        Moo that's not _that_ easy; you have to avoid an overflow;
+        wonder if CW contains similar catches... */

The last line of that comment is not useful in the code.

       for (i = 0; i < kwset->mind; ++i)
-       delta[(unsigned char) kwset->target[i]] = kwset->mind - (i + 1);
[For the purpose of this review I've moved the patch line "- kwset->mind2 = kwset->mind;" down from here, where the "diff" program put it, to group the logical changes together in my email.]
+       delta[(unsigned char) kwset->target[i]] =
+         MIN(kwset->mind - (i + 1), UCHAR_MAX);

OK.  This appears to be the bug fix.

-      kwset->mind2 = kwset->mind;
       /* Find the minimal delta2 shift that we might make after
         a backwards match has failed. */
-      for (i = 0; i < kwset->mind - 1; ++i)
-       if (kwset->target[i] == kwset->target[kwset->mind - 1])
-         kwset->mind2 = kwset->mind - (i + 1);
+      i = kwset->mind - 1;
+      while (i-- && kwset->target[i] != kwset->target[kwset->mind - 1])
+       ;
+      kwset->mind2 = kwset->mind - (i + 1);

I agree that that's functionally the same but more efficient: it stops as soon as it finds a match.

It took me a long time to satisfy myself that this optimisation is correct, mainly because I didn't see the removed line "kwset->mind2 = kwset->mind;" because it was hiding in the bug-fix part of the patch. That is part of the reason why the bug fix should be written, reviewed and committed separately from the optimisations and tidying.

     }

So, I have prepared a patch, attached as "grep_bm_fix.diff", containing a log message, the fix, and regression tests, but not the optimisations/tidying.

Reviews, please.

- Julian

Attachment: grep_bm_bug.tar.gz
Description: GNU Zip compressed data

Index: ChangeLog
===================================================================
RCS file: /cvsroot/grep/grep/ChangeLog,v
retrieving revision 1.247
diff -u -3 -p -d -r1.247 ChangeLog
--- ChangeLog   14 Jun 2005 20:56:41 -0000      1.247
+++ ChangeLog   15 Jun 2005 00:39:25 -0000
@@ -1,3 +1,15 @@
+2005-06-15  Julian Foad  <address@hidden>
+
+       Fix a bug in the Boyer-Moore code in handling certain potential
+       matches over 255 bytes long, resulting in spurious or missed matches.
+       Thanks to Steve Summit for the bug report and reproduction recipe,
+       and to Stepan Kasal for the fix and test cases.
+
+       * src/kwset.c
+         (MIN): New macro.
+         (kwsprep): Ensure that "delta" values don't overflow.
+       * tests/foad1.sh: Add regression tests for this.
+
 2005-06-14  Charles Levert  <address@hidden>
 
        Fix bug #11022 (Line wrapping causes GREP_COLOR background
Index: src/kwset.c
===================================================================
RCS file: /cvsroot/grep/grep/src/kwset.c,v
retrieving revision 1.6
diff -u -3 -p -d -r1.6 kwset.c
--- src/kwset.c 2 May 2005 09:47:48 -0000       1.6
+++ src/kwset.c 15 Jun 2005 00:39:26 -0000
@@ -46,6 +46,9 @@ extern char *xmalloc();
 #define obstack_chunk_alloc malloc
 #define obstack_chunk_free free
 
+#undef MIN
+#define MIN(A,B) ((A) < (B) ? (A) : (B))
+
 /* Balanced tree of edges and labels leaving a given trie node. */
 struct tree
 {
@@ -404,9 +407,11 @@ kwsprep (kwset_t kws)
          kwset->target[i] = curr->links->label;
          curr = curr->links->trie;
        }
-      /* Build the Boyer Moore delta.  Boy that's easy compared to CW. */
+      /* Build the Boyer-Moore delta.  Boy that's easy compared to CW.
+        Moo that's not _that_ easy; you have to avoid an overflow. */
       for (i = 0; i < kwset->mind; ++i)
-       delta[(unsigned char) kwset->target[i]] = kwset->mind - (i + 1);
+       delta[(unsigned char) kwset->target[i]] =
+         MIN(kwset->mind - (i + 1), UCHAR_MAX);
       kwset->mind2 = kwset->mind;
       /* Find the minimal delta2 shift that we might make after
         a backwards match has failed. */
Index: tests/foad1.sh
===================================================================
RCS file: /cvsroot/grep/grep/tests/foad1.sh,v
retrieving revision 1.8
diff -u -3 -p -d -r1.8 foad1.sh
--- tests/foad1.sh      14 Jun 2005 20:56:42 -0000      1.8
+++ tests/foad1.sh      15 Jun 2005 00:39:26 -0000
@@ -140,4 +140,16 @@ then
 fi
 
 
+# Test for Boyer-Moore bugs.  Thanks to Steve Summit for reporting the bug
+# and providing a test case which inspired these.
+grep_test \
+  
"abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"
 \
+  "" \
+  
"abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
+grep_test \
+  
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcxxxxxxxxxxxxxxxxxx"
 \
+  
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcxxxxxxxxxxxxxxxxxx/"
 \
+  
"cbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc"
+
+
 exit $failures

reply via email to

[Prev in Thread] Current Thread [Next in Thread]