Re: Dealing with character ranges in grep

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with character ranges in grep

From:	Paolo Bonzini
Subject:	Re: Dealing with character ranges in grep
Date:	Thu, 09 Jun 2011 10:47:22 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

[making this public, there should be no reason not to]

On 06/08/2011 10:14 PM, Aharon Robbins wrote:

Hi.  As we've discussed a little previously, I finally got tired of
trying to explain to users why the character range [a-z] was matching
most uppercase letters also.  ("I've found a bug in gawk! [a-z] matches 'C' !"
"No - it's a POSIX locale issue".)  This had to be the most F of the FAQs.

So, for the upcoming gawk 4.0, I decided (as Karl put it) to cut the
Gordian knot and make ranges behave like the C locale, the way it's long
been documented, and as most people expect.  Those who want the POSIX
behavior can still get it using --posix.

So I went back and just made the fix in the dfa and regex code, by
introducing a new syntax bit, RE_RANGES_IGNORE_LOCALES, which is turned
on in RE_SYNTAX_GNU_AWK and added to RE_SYNTAX_AWK for gawk's --traditional
option.  (This turned out to be easier than I'd feared it would be.)


Actually, it should be even easier, for two reasons.

First reason: using wcscoll is quite broken, even more so than collationequivalent ordering. Besides, we should be in control of the non-_LIBCcases, so we should submit a patch to glibc (or patch gnulib locally)that makes your RE_RANGES_IGNORE_LOCALES the sole possibility when _LIBCis not defined.

Second reason: nowadays, dfa.c always punts on parsing of multibytebracketed expressions and defers to regex. The code that handlesmbcsets is there in case someone is using dfaexec with a NULL backrefargument, but we might as well remove it and I wouldn't complain at all.So, dfa.c also does not need any special casing ofRE_RANGES_IGNORE_LOCALES. Instead, hard_LC_COLLATE should be removed asa premature optimization.

The important point is to realize that you cannot fix the whole problem:--without-included-regex will forever yield glibc's CEO (you cannot helpthat, and if distros choose to use it you will still get bogus bugreports), while the default choice of --with-included-regex will givewchar_t ordering. The above solution takes this into account, andwithin this constraint it provides a much cleaner result:

1) no need for POSIXLY_CORRECT (which would be an abuse ofPOSIXLY_CORRECT actually... where's POSIX_ME_HARDER when you need it? ;)

2) as a result of (1), no need to say anything in the documentation(and, anything you say would likely be incorrect in the--without-included-regex case);


3) no need for extra flags and changes to the regex clients;

4) no need to care about consistency between dfa.c and regex definitions;

5) instant applicability of the solution to all GNU packages just byupgrading gnulib or importing a new version of regex.

So, unlike before, you sold me on this, *provided the above plan isimplemented*. The difference is that this approach, I think, does notcause more headaches than it solves. Hopefully, it will not provide anynew headache assuming we can synchronize decently a release of gawk,grep and sed!

Would it be too much to ask to hold gawk 4.0 until the above plan isrealized? It's strictly about gawk/grep/gnulib; no need to involveglibc from the beginning. Even better, would anyone help with the workwhile I'm on vacation (from Saturday till the 26th of June)?


Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Dealing with character ranges in grep, Paolo Bonzini <=
- Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/09
  - Re: Dealing with character ranges in grep, Paolo Bonzini, 2011/06/09
    - Re: Dealing with character ranges in grep, Bruno Haible, 2011/06/09
    - Re: Dealing with character ranges in grep, Paolo Bonzini, 2011/06/09
    - Re: Dealing with character ranges in grep, Bruno Haible, 2011/06/09
    - implementing extended bracket expressions in gnulib [was Re: Dealing with character ranges in grep], Paolo Bonzini, 2011/06/09
    - Re: implementing extended bracket expressions in gnulib [was Re: Dealing with character ranges in grep], Bruno Haible, 2011/06/09
    - Re: implementing extended bracket expressions in gnulib [was Re: Dealing with character ranges in grep], Paolo Bonzini, 2011/06/09
    - Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/10
    - Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/15

Prev by Date: [PATCH] tests: don't ignore sjis-mb test failure
Next by Date: Re: Dealing with character ranges in grep
Previous by thread: [PATCH] tests: don't ignore sjis-mb test failure
Next by thread: Re: Dealing with character ranges in grep
Index(es):
- Date
- Thread