[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-grep] Boyer Moore overflow patch
From: |
Julian Foad |
Subject: |
Re: [bug-grep] Boyer Moore overflow patch |
Date: |
Wed, 15 Jun 2005 01:45:10 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8b) Gecko/20050217 |
I'm looking back at this post from 22 November 2004:
Stepan Kasal wrote:
some time ago I have created a patch to kwset.c, which fixes an
overflow in the Boyer-Moore algorithm. According to our rules, I'm
posting it here for review before I check it in.
In a follow-up post, Stepan provided three test cases:
The attached tar file [grep_bm_bug.tar.gz] contains three test cases:
subdir 1/ grep -f pat txt # shows a false match
subdir 2/ grep -f pat txt # misses a match
subdir fgrepbug/ grep -f pat badfile # misses a match
All these bugs work [i.e. the tests fail] with grep, fgrep and egrep--grep is
smart enough to see
that there are no meta characters in the pattern.
The last example comes from bug report
http://lists.gnu.org/archive/html/bug-gnu-utils/2003-02/msg00105.html
so we should credit Steve Summit for it, and for reporting the bug.
Example 2/ is the same as fgrepbug, but it artificially created.
I can supply more details how the bug works, if you want.
I have incorporated the (essentially two) tests into tests/foad1.sh.
------------------------------------------------------------------------
Tue Feb 11 15:23:30 CET 2003 Stepan Kasal <address@hidden>
* src/kwset.c (MIN): New macro.
(kwsprep): Use MIN to conveniently fill delta[] with initial
values; use it again when filling delta[] with the actual values
---this fixes a weirdly looking overflow bug if the shift is a
multiple of 256; use a bit different code to detrmine mind2.
I'm going to ask for a bit of a higher-level explanation in the log message.
The last clause, "use a bit different code to detrmine mind2" doesn't really
say anything useful. The reader asks, "Why?". "Use more efficient code to
determine mind2" would be better. The central clause, about the bug fix,
doesn't give enough indication for anyone but a Grep expert to know what bug is
being fixed. Adding a regression test as part of the same commit would be a
good way to make it clear, and I know Stepan was intending to do so, and was
asking for help with that.
Now to review the patch itself:
diff -urpN grep-2.5.1/src/kwset.c grep-2.5.1.kw/src/kwset.c
--- grep-2.5.1/src/kwset.c Thu Feb 8 16:06:36 2001
+++ grep-2.5.1.kw/src/kwset.c Tue Feb 11 15:18:35 2003
@@ -46,6 +46,9 @@ extern char *xmalloc();
#define obstack_chunk_alloc malloc
#define obstack_chunk_free free
+#undef MIN
+#define MIN(A,B) ((A) < (B) ? (A) : (B))
+
/* Balanced tree of edges and labels leaving a given trie node. */
struct tree
{
@@ -386,12 +389,8 @@ kwsprep (kwset_t kws)
/* Initial values for the delta table; will be changed later. The
delta entry for a given character is the smallest depth of any
node at which an outgoing edge is labeled by that character. */
- if (kwset->mind < 256)
- for (i = 0; i < NCHAR; ++i)
- delta[i] = kwset->mind;
- else
- for (i = 0; i < NCHAR; ++i)
- delta[i] = 255;
+ for (i = 0; i < NCHAR; ++i)
+ delta[i] = MIN(kwset->mind, UCHAR_MAX);
This change appears to be a no-op, making the source code neater but
potentially slower (depending on how good the compiler's optimiser is). If we
really want to make this change at the same time as fixing the bug, isn't
"memset" neater and guaranteed to be fast?
memset(delta, MIN(kwset->mind, UCHAR_MAX), NCHAR);
/* Check if we can use the simple boyer-moore algorithm, instead
of the hairy commentz-walter algorithm. */
@@ -404,15 +403,18 @@ kwsprep (kwset_t kws)
kwset->target[i] = curr->links->label;
curr = curr->links->trie;
}
- /* Build the Boyer Moore delta. Boy that's easy compared to CW. */
+ /* Build the Boyer-Moore delta. Boy that's easy compared to CW.
+ Moo that's not _that_ easy; you have to avoid an overflow;
+ wonder if CW contains similar catches... */
The last line of that comment is not useful in the code.
for (i = 0; i < kwset->mind; ++i)
- delta[(unsigned char) kwset->target[i]] = kwset->mind - (i + 1);
[For the purpose of this review I've moved the patch line "- kwset->mind2 =
kwset->mind;" down from here, where the "diff" program put it, to group the
logical changes together in my email.]
+ delta[(unsigned char) kwset->target[i]] =
+ MIN(kwset->mind - (i + 1), UCHAR_MAX);
OK. This appears to be the bug fix.
- kwset->mind2 = kwset->mind;
/* Find the minimal delta2 shift that we might make after
a backwards match has failed. */
- for (i = 0; i < kwset->mind - 1; ++i)
- if (kwset->target[i] == kwset->target[kwset->mind - 1])
- kwset->mind2 = kwset->mind - (i + 1);
+ i = kwset->mind - 1;
+ while (i-- && kwset->target[i] != kwset->target[kwset->mind - 1])
+ ;
+ kwset->mind2 = kwset->mind - (i + 1);
I agree that that's functionally the same but more efficient: it stops as soon
as it finds a match.
It took me a long time to satisfy myself that this optimisation is correct,
mainly because I didn't see the removed line "kwset->mind2 = kwset->mind;"
because it was hiding in the bug-fix part of the patch. That is part of the
reason why the bug fix should be written, reviewed and committed separately
from the optimisations and tidying.
}
So, I have prepared a patch, attached as "grep_bm_fix.diff", containing a log
message, the fix, and regression tests, but not the optimisations/tidying.
Reviews, please.
- Julian
grep_bm_bug.tar.gz
Description: GNU Zip compressed data
Index: ChangeLog
===================================================================
RCS file: /cvsroot/grep/grep/ChangeLog,v
retrieving revision 1.247
diff -u -3 -p -d -r1.247 ChangeLog
--- ChangeLog 14 Jun 2005 20:56:41 -0000 1.247
+++ ChangeLog 15 Jun 2005 00:39:25 -0000
@@ -1,3 +1,15 @@
+2005-06-15 Julian Foad <address@hidden>
+
+ Fix a bug in the Boyer-Moore code in handling certain potential
+ matches over 255 bytes long, resulting in spurious or missed matches.
+ Thanks to Steve Summit for the bug report and reproduction recipe,
+ and to Stepan Kasal for the fix and test cases.
+
+ * src/kwset.c
+ (MIN): New macro.
+ (kwsprep): Ensure that "delta" values don't overflow.
+ * tests/foad1.sh: Add regression tests for this.
+
2005-06-14 Charles Levert <address@hidden>
Fix bug #11022 (Line wrapping causes GREP_COLOR background
Index: src/kwset.c
===================================================================
RCS file: /cvsroot/grep/grep/src/kwset.c,v
retrieving revision 1.6
diff -u -3 -p -d -r1.6 kwset.c
--- src/kwset.c 2 May 2005 09:47:48 -0000 1.6
+++ src/kwset.c 15 Jun 2005 00:39:26 -0000
@@ -46,6 +46,9 @@ extern char *xmalloc();
#define obstack_chunk_alloc malloc
#define obstack_chunk_free free
+#undef MIN
+#define MIN(A,B) ((A) < (B) ? (A) : (B))
+
/* Balanced tree of edges and labels leaving a given trie node. */
struct tree
{
@@ -404,9 +407,11 @@ kwsprep (kwset_t kws)
kwset->target[i] = curr->links->label;
curr = curr->links->trie;
}
- /* Build the Boyer Moore delta. Boy that's easy compared to CW. */
+ /* Build the Boyer-Moore delta. Boy that's easy compared to CW.
+ Moo that's not _that_ easy; you have to avoid an overflow. */
for (i = 0; i < kwset->mind; ++i)
- delta[(unsigned char) kwset->target[i]] = kwset->mind - (i + 1);
+ delta[(unsigned char) kwset->target[i]] =
+ MIN(kwset->mind - (i + 1), UCHAR_MAX);
kwset->mind2 = kwset->mind;
/* Find the minimal delta2 shift that we might make after
a backwards match has failed. */
Index: tests/foad1.sh
===================================================================
RCS file: /cvsroot/grep/grep/tests/foad1.sh,v
retrieving revision 1.8
diff -u -3 -p -d -r1.8 foad1.sh
--- tests/foad1.sh 14 Jun 2005 20:56:42 -0000 1.8
+++ tests/foad1.sh 15 Jun 2005 00:39:26 -0000
@@ -140,4 +140,16 @@ then
fi
+# Test for Boyer-Moore bugs. Thanks to Steve Summit for reporting the bug
+# and providing a test case which inspired these.
+grep_test \
+
"abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"
\
+ "" \
+
"abbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
+grep_test \
+
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcxxxxxxxxxxxxxxxxxx"
\
+
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxcbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcxxxxxxxxxxxxxxxxxx/"
\
+
"cbbbabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbc"
+
+
exit $failures
- Re: [bug-grep] Boyer Moore overflow patch,
Julian Foad <=