bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8


From: Jim Meyering
Subject: Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale)
Date: Thu, 02 Jun 2011 23:08:39 +0200

Santiago Ruano Rincón wrote:
> Follow-up Comment #3, bug #33198 (project grep):
> It seems the problem is still unsolved. I've tried both, 2.8 and patching 2.7,
> but I got the same results. Igor Ladygin confirms this.
>
> address@hidden:~$ echo Пример| LC_ALL=ru_RU.KOI8-R grep -qE "[Пп]";
> echo $?
> 1

Thank you.

At first I was going to say this:

  You are using ru_RU.KOI8-R, which is a uni-byte locale, yet your
  inputs (both stdin and the grep regexp) use the two-byte representation,
  П (\xd0\9f), instead of the uni-byte П (\360).

But it fails even with the single-byte version.
So it is indeed a bug in grep, but at least this time
it affects relatively few locales.

Here's the fix I expect to use and a test case to exercise it.

>From 8e214a2ecc4bac7f8341deb3646b6f1c3819dac3 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Thu, 2 Jun 2011 18:03:49 +0200
Subject: [PATCH 1/2] fix the range bug also for relatively unusual uni-byte
 encodings

* src/dfa.c (setbit_case_fold) Bug fix. FIXME
* NEWS (Bug fixes): Mention it.
---
 NEWS      |    4 ++++
 src/dfa.c |    7 +++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 312c803..67b3fad 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@ GNU grep NEWS                                    -*- outline 
-*-

 ** Bug fixes

+  echo c|grep '[c]' would fail for any c in 0x80..0xff, with a uni-byte
+  encoding for which the byte-to-wide-char mapping is nontrivial.  For
+  example, the ISO-88591 locales are not affected, but ru_RU.KOI8-R is.
+
   grep -P no longer aborts when PCRE's backtracking limit is exceeded
   Before, echo aaaaaaaaaaaaaab |grep -P '((a+)*)+$' would abort.  Now,
   it diagnoses the problem and exits with status 2.
diff --git a/src/dfa.c b/src/dfa.c
index b41cbb6..0ce6242 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -573,8 +573,11 @@ setbit_case_fold (
   else
     {
 #if MBS_SUPPORT
-      int b2 = wctob ((unsigned char) b);
-      if (b2 == EOF || b2 == b)
+      /* Below, note how when b2 != b and we have a uni-byte locale
+         (MB_CUR_MAX == 1), we set b = b2.  I.e., in a uni-byte locale,
+         we can safely call setbit with a non-EOF value returned by wctob.  */
+      int b2 = wctob (b);
+      if (b2 == EOF || b2 == b || (MB_CUR_MAX == 1 ? (b=b2), 1 : 0))
 #endif
         setbit (b, c);
     }
--
1.7.6.rc0.254.gf37de


>From c93e621ac20d085abda4cf3c269f5cf902671a84 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Thu, 2 Jun 2011 11:01:35 +0200
Subject: [PATCH 2/2] tests: exercise a non-UTF8 multi-byte range bug:
 requires ru_RU.KOI8-R

* tests/mb-non-utf8-range: New file.
* tests/Makefile.am (TESTS): Add it.
* init.cfg (require_ru_RU_koi8_r): New function.
---
 tests/Makefile.am       |    1 +
 tests/init.cfg          |    9 +++++++++
 tests/mb-non-utf8-range |   41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)
 create mode 100644 tests/mb-non-utf8-range

diff --git a/tests/Makefile.am b/tests/Makefile.am
index a01b004..2d0527a 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -63,6 +63,7 @@ TESTS =                                               \
   inconsistent-range                            \
   khadafy                                      \
   max-count-vs-context                         \
+  mb-non-utf8-range                            \
   high-bit-range                               \
   options                                      \
   pcre                                         \
diff --git a/tests/init.cfg b/tests/init.cfg
index 3429f0d..f6ead9c 100644
--- a/tests/init.cfg
+++ b/tests/init.cfg
@@ -69,3 +69,12 @@ require_en_utf8_locale_()
     *) skip_test_ 'en_US.UTF-8 locale not found' ;;
   esac
 }
+
+require_ru_RU_koi8_r()
+{
+  path_prepend_ .
+  case $(get-mb-cur-max ru_RU.KOI8-R) in
+    1) ;;
+    *) skip_test_ 'ru_RU.KOI8-R locale not found' ;;
+  esac
+}
diff --git a/tests/mb-non-utf8-range b/tests/mb-non-utf8-range
new file mode 100644
index 0000000..a0b51dd
--- /dev/null
+++ b/tests/mb-non-utf8-range
@@ -0,0 +1,41 @@
+#!/bin/sh
+# Exercise a DFA range bug that arises only with a unibyte encoding
+# for which the wide-char-to-single-byte mapping is nontrivial.
+# E.g., the regexp, [C] would fail to match C in a unibyte locale like
+# ru_RU.KOI8-R for any C whose wide-char representation differed from
+# its single-byte equivalent.
+
+# Copyright (C) 2011 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_ru_RU_koi8_r
+LC_ALL=ru_RU.KOI8-R
+export LC_ALL
+
+fail=0
+
+for i in 8 9 a b c d e f; do
+  for j in 0 1 2 3 4 5 6 7 8 9 a b c d e f; do
+    in=in-$i$j
+    b=$(printf "\\x$i$j")
+    echo "$b" > $in || framework_failure_
+    cp $in /t
+    grep "[$b]" $in > out || fail=1
+    compare out $in || fail=1
+  done
+done
+
+Exit $fail
--
1.7.6.rc0.254.gf37de



reply via email to

[Prev in Thread] Current Thread [Next in Thread]