[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8
From: |
Jim Meyering |
Subject: |
Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale) |
Date: |
Thu, 02 Jun 2011 23:08:39 +0200 |
Santiago Ruano Rincón wrote:
> Follow-up Comment #3, bug #33198 (project grep):
> It seems the problem is still unsolved. I've tried both, 2.8 and patching 2.7,
> but I got the same results. Igor Ladygin confirms this.
>
> address@hidden:~$ echo Пример| LC_ALL=ru_RU.KOI8-R grep -qE "[Пп]";
> echo $?
> 1
Thank you.
At first I was going to say this:
You are using ru_RU.KOI8-R, which is a uni-byte locale, yet your
inputs (both stdin and the grep regexp) use the two-byte representation,
П (\xd0\9f), instead of the uni-byte П (\360).
But it fails even with the single-byte version.
So it is indeed a bug in grep, but at least this time
it affects relatively few locales.
Here's the fix I expect to use and a test case to exercise it.
>From 8e214a2ecc4bac7f8341deb3646b6f1c3819dac3 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Thu, 2 Jun 2011 18:03:49 +0200
Subject: [PATCH 1/2] fix the range bug also for relatively unusual uni-byte
encodings
* src/dfa.c (setbit_case_fold) Bug fix. FIXME
* NEWS (Bug fixes): Mention it.
---
NEWS | 4 ++++
src/dfa.c | 7 +++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/NEWS b/NEWS
index 312c803..67b3fad 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@ GNU grep NEWS -*- outline
-*-
** Bug fixes
+ echo c|grep '[c]' would fail for any c in 0x80..0xff, with a uni-byte
+ encoding for which the byte-to-wide-char mapping is nontrivial. For
+ example, the ISO-88591 locales are not affected, but ru_RU.KOI8-R is.
+
grep -P no longer aborts when PCRE's backtracking limit is exceeded
Before, echo aaaaaaaaaaaaaab |grep -P '((a+)*)+$' would abort. Now,
it diagnoses the problem and exits with status 2.
diff --git a/src/dfa.c b/src/dfa.c
index b41cbb6..0ce6242 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -573,8 +573,11 @@ setbit_case_fold (
else
{
#if MBS_SUPPORT
- int b2 = wctob ((unsigned char) b);
- if (b2 == EOF || b2 == b)
+ /* Below, note how when b2 != b and we have a uni-byte locale
+ (MB_CUR_MAX == 1), we set b = b2. I.e., in a uni-byte locale,
+ we can safely call setbit with a non-EOF value returned by wctob. */
+ int b2 = wctob (b);
+ if (b2 == EOF || b2 == b || (MB_CUR_MAX == 1 ? (b=b2), 1 : 0))
#endif
setbit (b, c);
}
--
1.7.6.rc0.254.gf37de
>From c93e621ac20d085abda4cf3c269f5cf902671a84 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Thu, 2 Jun 2011 11:01:35 +0200
Subject: [PATCH 2/2] tests: exercise a non-UTF8 multi-byte range bug:
requires ru_RU.KOI8-R
* tests/mb-non-utf8-range: New file.
* tests/Makefile.am (TESTS): Add it.
* init.cfg (require_ru_RU_koi8_r): New function.
---
tests/Makefile.am | 1 +
tests/init.cfg | 9 +++++++++
tests/mb-non-utf8-range | 41 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+), 0 deletions(-)
create mode 100644 tests/mb-non-utf8-range
diff --git a/tests/Makefile.am b/tests/Makefile.am
index a01b004..2d0527a 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -63,6 +63,7 @@ TESTS = \
inconsistent-range \
khadafy \
max-count-vs-context \
+ mb-non-utf8-range \
high-bit-range \
options \
pcre \
diff --git a/tests/init.cfg b/tests/init.cfg
index 3429f0d..f6ead9c 100644
--- a/tests/init.cfg
+++ b/tests/init.cfg
@@ -69,3 +69,12 @@ require_en_utf8_locale_()
*) skip_test_ 'en_US.UTF-8 locale not found' ;;
esac
}
+
+require_ru_RU_koi8_r()
+{
+ path_prepend_ .
+ case $(get-mb-cur-max ru_RU.KOI8-R) in
+ 1) ;;
+ *) skip_test_ 'ru_RU.KOI8-R locale not found' ;;
+ esac
+}
diff --git a/tests/mb-non-utf8-range b/tests/mb-non-utf8-range
new file mode 100644
index 0000000..a0b51dd
--- /dev/null
+++ b/tests/mb-non-utf8-range
@@ -0,0 +1,41 @@
+#!/bin/sh
+# Exercise a DFA range bug that arises only with a unibyte encoding
+# for which the wide-char-to-single-byte mapping is nontrivial.
+# E.g., the regexp, [C] would fail to match C in a unibyte locale like
+# ru_RU.KOI8-R for any C whose wide-char representation differed from
+# its single-byte equivalent.
+
+# Copyright (C) 2011 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_ru_RU_koi8_r
+LC_ALL=ru_RU.KOI8-R
+export LC_ALL
+
+fail=0
+
+for i in 8 9 a b c d e f; do
+ for j in 0 1 2 3 4 5 6 7 8 9 a b c d e f; do
+ in=in-$i$j
+ b=$(printf "\\x$i$j")
+ echo "$b" > $in || framework_failure_
+ cp $in /t
+ grep "[$b]" $in > out || fail=1
+ compare out $in || fail=1
+ done
+done
+
+Exit $fail
--
1.7.6.rc0.254.gf37de
- [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Santiago Ruano Rincón, 2011/06/02
- Re: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale),
Jim Meyering <=
- Re: Bug#624387: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Paolo Bonzini, 2011/06/03
- Re: Bug#624387: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Jim Meyering, 2011/06/04
- Re: Bug#624387: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Paolo Bonzini, 2011/06/05
- Re: Bug#624387: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Jim Meyering, 2011/06/05
- Re: Bug#624387: [bug #33198] Incorrect bracket expression when parsing in ru_RU.KOI8-R (Russian locale), Jim Meyering, 2011/06/05
[PATCH] dfa: fix case folding logic for character ranges, Paolo Bonzini, 2011/06/07