bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with character ranges in grep


From: Paolo Bonzini
Subject: Re: Dealing with character ranges in grep
Date: Thu, 09 Jun 2011 10:47:22 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

[making this public, there should be no reason not to]

On 06/08/2011 10:14 PM, Aharon Robbins wrote:
Hi.  As we've discussed a little previously, I finally got tired of
trying to explain to users why the character range [a-z] was matching
most uppercase letters also.  ("I've found a bug in gawk! [a-z] matches 'C' !"
"No - it's a POSIX locale issue".)  This had to be the most F of the FAQs.

So, for the upcoming gawk 4.0, I decided (as Karl put it) to cut the
Gordian knot and make ranges behave like the C locale, the way it's long
been documented, and as most people expect.  Those who want the POSIX
behavior can still get it using --posix.

So I went back and just made the fix in the dfa and regex code, by
introducing a new syntax bit, RE_RANGES_IGNORE_LOCALES, which is turned
on in RE_SYNTAX_GNU_AWK and added to RE_SYNTAX_AWK for gawk's --traditional
option.  (This turned out to be easier than I'd feared it would be.)

Actually, it should be even easier, for two reasons.

First reason: using wcscoll is quite broken, even more so than collation equivalent ordering. Besides, we should be in control of the non-_LIBC cases, so we should submit a patch to glibc (or patch gnulib locally) that makes your RE_RANGES_IGNORE_LOCALES the sole possibility when _LIBC is not defined.

Second reason: nowadays, dfa.c always punts on parsing of multibyte bracketed expressions and defers to regex. The code that handles mbcsets is there in case someone is using dfaexec with a NULL backref argument, but we might as well remove it and I wouldn't complain at all. So, dfa.c also does not need any special casing of RE_RANGES_IGNORE_LOCALES. Instead, hard_LC_COLLATE should be removed as a premature optimization.

The important point is to realize that you cannot fix the whole problem: --without-included-regex will forever yield glibc's CEO (you cannot help that, and if distros choose to use it you will still get bogus bug reports), while the default choice of --with-included-regex will give wchar_t ordering. The above solution takes this into account, and within this constraint it provides a much cleaner result:

1) no need for POSIXLY_CORRECT (which would be an abuse of POSIXLY_CORRECT actually... where's POSIX_ME_HARDER when you need it? ;)

2) as a result of (1), no need to say anything in the documentation (and, anything you say would likely be incorrect in the --without-included-regex case);

3) no need for extra flags and changes to the regex clients;

4) no need to care about consistency between dfa.c and regex definitions;

5) instant applicability of the solution to all GNU packages just by upgrading gnulib or importing a new version of regex.

So, unlike before, you sold me on this, *provided the above plan is implemented*. The difference is that this approach, I think, does not cause more headaches than it solves. Hopefully, it will not provide any new headache assuming we can synchronize decently a release of gawk, grep and sed!

Would it be too much to ask to hold gawk 4.0 until the above plan is realized? It's strictly about gawk/grep/gnulib; no need to involve glibc from the beginning. Even better, would anyone help with the work while I'm on vacation (from Saturday till the 26th of June)?

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]