[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#16481: dfa.c and Rational Range Interpretation
From: |
Paul Eggert |
Subject: |
bug#16481: dfa.c and Rational Range Interpretation |
Date: |
Fri, 17 Jan 2014 14:43:29 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
Thanks for continuing to bird-dog this.
On 01/17/2014 05:39 AM, Aharon Robbins wrote:
> the following diff lets grep check the other awk syntax
> variants. Feel free to apply it.
I did that (the first patch enclosed below).
Thanks.
> I do think that gawk's code is the correct thing to be doing for RRI.
I agree, and installed the second patch enclosed below to
implement this. This patch also includes some documentation
changes -- if you have a bit of time to review them I'd
appreciate it.
Also, I notice that there are a few "#ifdef GREP"s in dfa.c
Do you happen to know why they're needed? It'd be nice if
we could simplify dfa.c to omit the need for the GREP macro.
> Additionally, I recommend that grep's configure check for good RRI
> support in the system regex routines and switch to the included ones
> if the system ones don't support it.
Unfortunately that'd break support for equivalence classes
and multibyte collation symbols on GNU/Linux platforms, so
it may be a bridge too far. Until we get glibc fixed, I
think it's OK to live with the situation where [a-z]
ordinarily has the rational range interpretation, and this
breaks down only for complicated matches where the DFA
doesn't suffice; at least it'll work in the usual case.
>From c862ced6f31f0ccdf2505ac46e354a1a011149cd Mon Sep 17 00:00:00 2001
From: Aharon Robbins <address@hidden>
Date: Fri, 17 Jan 2014 12:42:49 -0800
Subject: [PATCH 1/2] grep: add undocumented '-X gawk' and '-X posixawk'
options
See <http://bugs.gnu.org/16481>.
* src/grep.c (GAcompile, PAcompile): New functions.
(const): Use them.
---
src/grep.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/src/grep.c b/src/grep.c
index 1b2198f..12644a2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size)
GEAcompile (pattern, size, RE_SYNTAX_AWK);
}
+static void
+GAcompile (char const *pattern, size_t size)
+{
+ GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK);
+}
+
+static void
+PAcompile (char const *pattern, size_t size)
+{
+ GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK);
+}
+
struct matcher const matchers[] = {
{ "grep", Gcompile, EGexecute },
{ "egrep", Ecompile, EGexecute },
{ "awk", Acompile, EGexecute },
+ { "gawk", GAcompile, EGexecute },
+ { "posixawk", PAcompile, EGexecute },
{ "fgrep", Fcompile, Fexecute },
{ "perl", Pcompile, Pexecute },
{ NULL, NULL, NULL },
--
1.8.4.2
>From aba2c718908d6c8fcfd75d55a43a4c9b1e3405a3 Mon Sep 17 00:00:00 2001
From: Paul Eggert <address@hidden>
Date: Fri, 17 Jan 2014 14:32:10 -0800
Subject: [PATCH 2/2] grep: DFA now uses rational ranges in unibyte locales
Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
* NEWS:
* doc/grep.texi (Environment Variables)
(Character Classes and Bracket Expressions):
Document this.
* src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.
---
NEWS | 8 ++++++++
doc/grep.texi | 19 +++++++++----------
src/dfa.c | 20 ++------------------
3 files changed, 19 insertions(+), 28 deletions(-)
diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS -*- outline
-*-
grep -i in a multibyte locale is now typically 10 times faster
for patterns that do not contain \ or [.
+ Range expressions in unibyte locales now ordinarily use the rational
+ range interpretation, in which [a-z] matches only lower-case ASCII
+ letters regardless of locale, and similarly for other ranges. (This
+ was already true for multibyte locales.) Portable programs should
+ continue to specify the C locale when using range expressions, since
+ these expressions have unspecified behavior in non-GNU systems and
+ are not yet guaranteed to use the rational range interpretation even
+ in GNU systems.
* Noteworthy changes in release 2.16 (2014-01-01) [stable]
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true
when specified.
@cindex national language support
@cindex NLS
These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
@item LC_ALL
@itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
Within a bracket expression, a @dfn{range expression} consists of two
characters separated by a hyphen.
It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
To obtain the traditional interpretation
of bracket expressions, you can use the @samp{C} locale by setting the
@env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
}
else
{
- /* Defer to the system regex library about the meaning
- of range expressions. */
- regex_t re;
- char pattern[6] = { '[', 0, '-', 0, ']', 0 };
- char subject[2] = { 0, 0 };
c1 = c;
if (case_fold)
{
c1 = tolower (c1);
c2 = tolower (c2);
}
-
- pattern[1] = c1;
- pattern[3] = c2;
- regcomp (&re, pattern, REG_NOSUB);
- for (c = 0; c < NOTCHAR; ++c)
- {
- if ((case_fold && isupper (c)))
- continue;
- subject[0] = c;
- if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
- setbit_case_fold_c (c, ccl);
- }
- regfree (&re);
+ for (c = c1; c <= c2; c++)
+ setbit_case_fold_c (c, ccl);
}
colon_warning_state |= 8;
--
1.8.4.2
- bug#16481: dfa.c and Rational Range Interpretation, Aharon Robbins, 2014/01/17
- bug#16481: dfa.c and Rational Range Interpretation,
Paul Eggert <=
- bug#16481: dfa.c and Rational Range Interpretation, Aharon Robbins, 2014/01/18
- bug#16481: dfa.c and Rational Range Interpretation, Paul Eggert, 2014/01/20
- bug#16481: dfa.c and Rational Range Interpretation, Aharon Robbins, 2014/01/20
- bug#16481: dfa.c and Rational Range Interpretation, Paul Eggert, 2014/01/21
- bug#16481: dfa.c and Rational Range Interpretation, Jim Meyering, 2014/01/21
- bug#16421: bug#16481: dfa.c and Rational Range Interpretation, Paul Eggert, 2014/01/21
- bug#16481: dfa.c and Rational Range Interpretation, Aharon Robbins, 2014/01/25
- bug#16481: dfa.c and Rational Range Interpretation, Paul Eggert, 2014/01/25
- bug#16481: dfa.c and Rational Range Interpretation, Aharon Robbins, 2014/01/25