bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [patch #7134] Patch for is_mb_middle in searchutil.c


From: Jim Meyering
Subject: Re: [patch #7134] Patch for is_mb_middle in searchutil.c
Date: Sun, 28 Mar 2010 19:01:04 +0200

Jim Meyering wrote:
> Jim Meyering wrote:
>> Norihirio Tanaka wrote:
>>> URL:
>>>   <http://savannah.gnu.org/patch/?7134>
>>>
>>>                  Summary: Patch for is_mb_middle in searchutil.c
>>>                  Project: grep
>>>             Submitted by: noritnk
>>>             Submitted on: 2010年03月26日 13時00分00秒
>>>                 Category: None
>>>                 Priority: 5 - Normal
>>>                   Status: None
>>>                  Privacy: Public
>>>              Assigned to: None
>>>         Originator Email:
>>>              Open/Closed: Open
>>>          Discussion Lock: Any
>>>
>>> Details:
>>>
>>> I seem is_mb_middle has two bugs.
>>> This patch (merged) will be corrected both their bugs.
>>>
>>> 1. fgrep (2.6.1) hangs on a pattern with invalid sequence.
>>>
>>> ==> See attachment `fgrep-hang'.
>>>
>>> 2. A complete multibyte string matches with incomplate
>>> multibyte pattern with truncated octet character in fgrep (2.6.1).
>>>
>>> ==> See attachment `truncated-character.
>>
>> Thank you for the patch and test cases!
>>
>> I've adjusted your test scripts slightly, hooked them
>> up via tests/Makefile.am and added most of the commit logs.
>> I confirm that each test exercises a bug that is fixed by the patch.
>> I'll study your patch over the weekend.
>> Hmm... or maybe not.  Continue reading...
>
> I could not risk studying your patch (don't want to have to wait for
> the copyright assignment process), so wrote my own.  Since I've had to
> modify the test case (with your patch, grep printed the line, with mine
> it does not), I gather that our solutions are fundamentally different.
> In addition, I've changed the test to timeout after 10 seconds when run
> against a version of grep with the flaw.
>
>
>>From 4767f2f87399987afed8787de5cd4973358f76cd Mon Sep 17 00:00:00 2001
> From: Jim Meyering <address@hidden>
> Date: Sun, 28 Mar 2010 15:33:10 +0200
> Subject: [PATCH 1/2] grep -F: avoid infinite loop when searching for 
> incomplete MB character
>
> Searching for an incomplete non-prefix of a multi-byte character
> should find no match.
>
>   Just as these print nothing,
>     printf '\357\274\241\357\274\241\n' \
>       | perl -CIO -ne '/\241\357/ and print'
>     printf '\357\274\241\n' | perl -CIO -ne '/\274\241/ and print'
>     printf '\357\274\241\n' | perl -CIO -ne '/\241/ and print'
>     printf '\357\274\241\n' | perl -CIO -ne '/\274/ and print'
>
>   These should also print nothing, but with grep-2.6 and grep-2.6.1,
>   they would infloop:
>     printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\241'
>     printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\274'
>     printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\274\241'
>
> * src/kwsearch.c (Fexecute): Don't infloop when searching for
> an incomplete non-prefix part of a multi-byte character.
> * NEWS (Bug fixes): Mention it.
> Reported and diagnosed by Norihiro Tanaka.

That was for the fgrep-hang/infloop problem.
This is for the false match on prefix one.
What this means is that in a multi-byte locale,
including an incomplete multibyte character in the search
string ensures there can be no match.  As such, it's probably
better to diagnose that up front and exit right away than to
discover it for every file processed.  But even if that makes
sense, it should wait until after the bug-fix releases are done.



>From 692a042d32ed14f44fa753a9e9069f84519634bf Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Sun, 28 Mar 2010 15:37:58 +0200
Subject: [PATCH 1/2] grep -F: fix a multi-byte erroneous-match-in-middle bug

Just as Perl prints nothing in this case,
  printf '\357\274\241\n' | perl -CIO -lne '/\357/ and print'

grep should also print nothing when used as follows.
However, these would mistakenly match with grep prior to 2.6.2:
  printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\357'
  printf '\357\274\241\n' | LC_ALL=en_US.UTF-8 src/grep -F $'\357\274'

* src/searchutils.c (is_mb_middle): New parameter: the length of the
match, in bytes, as determined by kwsexec.  Use this to detect when
the nominal match found by kwsexec must be skipped because it is for
an incomplete multi-byte character that is a prefix of a character
in the input.
* src/dfasearch.c (EGexecute): Update caller.
* src/kwsearch.c (Fexecute): Likewise.
* src/search.h: Update prototype.
* NEWS (Bug fixes): Mention it.
Report and analysis by Norihiro Tanaka.
---
 NEWS              |    4 ++++
 src/dfasearch.c   |    4 +++-
 src/kwsearch.c    |    5 +++--
 src/search.h      |    2 +-
 src/searchutils.c |    5 +++--
 5 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/NEWS b/NEWS
index 2e8ef15..ddb3b49 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@ GNU grep NEWS                                    -*- outline 
-*-

 ** Bug fixes

+  grep -F no longer mistakenly reports a match when searching
+  for an incomplete prefix of a multibyte character.
+  [bug present since "the beginning"]
+
   grep -F no longer goes into an infinite loop when it finds a match for an
   incomplete (non-prefix of a) multibyte character.  [bug introduced in 2.6]

diff --git a/src/dfasearch.c b/src/dfasearch.c
index 5e7234a..5de40b6 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -251,7 +251,9 @@ EGexecute (char const *buf, size_t size, size_t *match_size,
 #ifdef MBS_SUPPORT
                   if (mb_start < beg)
                     mb_start = beg;
-                  if (MB_CUR_MAX == 1 || !is_mb_middle (&mb_start, match, 
buflim))
+                  if (MB_CUR_MAX == 1
+                     || !is_mb_middle (&mb_start, match, buflim,
+                                       kwsm.size[0]))
 #endif
                     goto success;
                 }
diff --git a/src/kwsearch.c b/src/kwsearch.c
index 85a28bc..a20c3a8 100644
--- a/src/kwsearch.c
+++ b/src/kwsearch.c
@@ -103,9 +103,11 @@ Fexecute (char const *buf, size_t size, size_t *match_size,
       size_t offset = kwsexec (kwset, beg, buf + size - beg, &kwsmatch);
       if (offset == (size_t) -1)
        goto failure;
+      len = kwsmatch.size[0];
 #ifdef MBS_SUPPORT
       char const *s0 = mb_start;
-      if (MB_CUR_MAX > 1 && is_mb_middle (&mb_start, beg + offset, buf + size))
+      if (MB_CUR_MAX > 1 && is_mb_middle (&mb_start, beg + offset, buf + size,
+                                         len))
         {
          if (mb_start == s0)
            goto failure;
@@ -114,7 +116,6 @@ Fexecute (char const *buf, size_t size, size_t *match_size,
         }
 #endif /* MBS_SUPPORT */
       beg += offset;
-      len = kwsmatch.size[0];
       if (start_ptr && !match_words)
        goto success_in_beg_and_len;
       if (match_lines)
diff --git a/src/search.h b/src/search.h
index 10e4d5c..e9049a9 100644
--- a/src/search.h
+++ b/src/search.h
@@ -42,7 +42,7 @@ void kwsinit (kwset_t *);

 #ifdef MBS_SUPPORT
 char * mbtolower (const char *, size_t *);
-bool is_mb_middle(const char **, const char *, const char *);
+bool is_mb_middle(const char **, const char *, const char *, size_t);
 #endif

 /* dfasearch.c */
diff --git a/src/searchutils.c b/src/searchutils.c
index e30355d..8c34e31 100644
--- a/src/searchutils.c
+++ b/src/searchutils.c
@@ -114,7 +114,8 @@ mbtolower (const char *beg, size_t *n)


 bool
-is_mb_middle(const char **good, const char *buf, const char *end)
+is_mb_middle (const char **good, const char *buf, const char *end,
+             size_t match_len)
 {
   const char *p = *good;
   const char *prev = p;
@@ -141,6 +142,6 @@ is_mb_middle(const char **good, const char *buf, const char 
*end)
     }

   *good = prev;
-  return p > buf;
+  return p > buf || match_len < mbrlen (p, end - p, &cur_state);
 }
 #endif /* MBS_SUPPORT */
--
1.7.0.3.448.g82eeb


>From 3611e91db2e6a35c2fa55b0f52d3ec18b30411b1 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <address@hidden>
Date: Sun, 28 Mar 2010 18:42:42 +0200
Subject: [PATCH 2/2] tests: exercise fix for improper match of incomplete MB 
char prefix

* tests/prefix-of-multibyte: New file.
* tests/Makefile.am (TESTS): Add it.
---
 tests/Makefile.am         |    1 +
 tests/prefix-of-multibyte |   19 +++++++++++++++++++
 2 files changed, 20 insertions(+), 0 deletions(-)
 create mode 100644 tests/prefix-of-multibyte

diff --git a/tests/Makefile.am b/tests/Makefile.am
index 9e0224f..3caeb78 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -44,6 +44,7 @@ TESTS =                                               \
   spencer1.sh                                  \
   spencer1-locale                              \
   status.sh                                    \
+  prefix-of-multibyte                          \
   warning.sh                                   \
   word-multi-file                              \
   yesno.sh
diff --git a/tests/prefix-of-multibyte b/tests/prefix-of-multibyte
new file mode 100644
index 0000000..f9e42bf
--- /dev/null
+++ b/tests/prefix-of-multibyte
@@ -0,0 +1,19 @@
+#!/bin/sh
+# This would mistakenly print a line prior to grep-2.6.2.
+: ${srcdir=.}
+. "$srcdir/init.sh"; path_prepend_ ../src
+
+encode() { echo "$1" | tr ABC '\357\274\241'; }
+
+fail=0
+
+for LOC in en_US.UTF-8 $LOCALE_FR_UTF8; do
+  for opt in '' '-F'; do
+    out=out-$opt-$LOC
+    encode ABC | LC_ALL=$LOC grep $opt "$(encode A)" > $out 2>&1
+    test $? = 1 || fail=1
+    compare $out /dev/null || fail=1
+  done
+done
+
+Exit $fail
--
1.7.0.3.448.g82eeb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]