[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
proposal: make [A-Z] range handling locale-independent
From: |
Jim Meyering |
Subject: |
proposal: make [A-Z] range handling locale-independent |
Date: |
Thu, 16 Jun 2011 11:56:31 +0200 |
Jim Meyering wrote:
> Jim Meyering wrote:
>> Bruno Haible wrote:
>>> Paolo,
>>>
>>>> > [=e=] to match "e" as well as accented versions like é, è and ê).
>>>> > That is the one feature that you get with glibc, and that you would
>>>> > sacrifice when building --with-included-regex.
>>>>
>>>> I agree. It's up to distros to choose, of course.
>>>
>>> If you are on the point of sacrificing a glibc feature in many programs,
>>> then IMO you should first talk with the glibc people to see what alternative
>>> they can offer.
>>
>> People who build the tools currently have the choice of using
>> --with-included-regex or
>> --without-included-regex
>>
>> Note that putting equivalence classes (and backrefs) aside, the
>> interpretation of ranges is done in dfa.c, which means the vast
>> majority of range uses never even require use of regexp code.
>>
>> However, backreferences force these tools to skip the DFA-based
>> optimization and resort to running the regexp code. In that case,
>> there is a dichotomy. Adding a backreference to a range-including
>> regexp would have the surprising consequence of changing how that range
>> is interpreted when the tool is built to use glibc's regexp code.
>>
>> Thus, if we go this route, we are effectively saying
>> that people who want self-consistent regex-handling
>> in our tools must build with --with-included-regex or end
>> up causing subtle problems.
>>
>> That's a big leap.
>> I'm not saying I won't take upstream grep over the edge,
>> but I'd like to hear what a few distro-maintainers think.
>
> To clarify...
> I like Arnold's proposal to make regex range handling sane
> and locale-independent.
To be precise, this was proposed by Arnold Robbins and Karl Berry.
> It goes like this (at least for gawk, grep and sed):
>
> change how dfa.c interprets ranges like [a-z]
> change how gnulib's reg* code handles ranges
>
> Always use the included regex code (the one from gnulib),
> so that its interpretation is consistent with that of dfa.c.
>
> Grep's current upstream default is to build --with-included-regex,
> which makes grep use glibc's regex code.
>
> To make this proposed change go through, that configure-time option would
> have to be eliminated, so that we always build with the gnulib-provided
> regex code. Of course, if glibc ever changes, we can detect that and
> automatically prefer it when possible.
Considering a wider audience, an example will help illustrate
what we want to (or dare I say "will" ;-) change.
In some locales, the [A-Z] regexp currently matches 25 of the
lower case letters. For example,
$ echo a| LC_ALL=cs_CZ grep '[A-Z]'
a
$ echo y| LC_ALL=cs_CZ grep '[A-Z]'
y
That is obviously undesirable, and this proposal is to make those
commands always print nothing, regardless of which locale you use.
I.e., they'll act like this:
$ echo y| LC_ALL=C grep '[A-Z]'
$
I think few will object.
Run the following command to see the names of locales installed on
your system that make grep exhibit this surprising behavior:
for i in $(locale -a);do echo b|LC_ALL=$i /bin/grep -q '[A-Z]' && echo $i; done
On Fedora 15, I see 62.
- Re: Dealing with character ranges in grep, (continued)
- Re: Dealing with character ranges in grep, Johannes Meixner, 2011/06/16
- Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/16
- Re: Dealing with character ranges in grep, Stanislav Brabec, 2011/06/18
- Re: Dealing with character ranges in grep, Johannes Meixner, 2011/06/16
- Re: Dealing with character ranges in grep, Stanislav Brabec, 2011/06/18
- Re: Dealing with character ranges in grep, Johannes Meixner, 2011/06/16
- Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/16
- Re: Dealing with character ranges in grep, Johannes Meixner, 2011/06/17
- Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/17
- Re: Dealing with character ranges in grep, Paolo Bonzini, 2011/06/27
- proposal: make [A-Z] range handling locale-independent,
Jim Meyering <=
- Re: Dealing with character ranges in grep, Aharon Robbins, 2011/06/16
- Re: Dealing with character ranges in grep, Paolo Bonzini, 2011/06/27
- Re: Dealing with character ranges in grep, Jim Meyering, 2011/06/27
Re: Dealing with character ranges in grep, Karl Berry, 2011/06/10
Re: Dealing with character ranges in grep, Paul Eggert, 2011/06/09