sed-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bracket expansions and "rational range" (was: bug#25048: --with-included


From: Assaf Gordon
Subject: bracket expansions and "rational range" (was: bug#25048: --with-included-regex vs. e-acute...)
Date: Tue, 29 Nov 2016 23:32:10 -0500

Hello Eric, Jim, Arnold,

[changing mailing list to sed-devel@ from discussion in 
https://debbugs.gnu.org/25048 ]

Regarding this:

> On Nov 28, 2016, at 11:53, Eric Blake <address@hidden> wrote:
> 
> On 11/27/2016 10:57 PM, Jim Meyering wrote:
>> When grep is configured --with-included-regex, the following command
>> fails to print the expected match:
>> 
>>   printf '\351\n' |LC_ALL=fr_FR.iso88591 src/grep '[d-f]'
[...]

> We SHOULD be adjusting more and more GNU tools to honor rational range
> behavior, at least as an option, even if that means that e-acute can
> never be matched to [d-f].

I'm working on the improving the sed manual, 
and just copied some parts from the grep manual.

Specifically about section "bracket expansions":
https://www.gnu.org/software/grep/manual/grep.html#Character-Classes-and-Bracket-Expressions

> In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be 
> equivalent
> to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or 
> the set of
> characters that it matches might even be erratic. To obtain the traditional 
> interpretation
> of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL 
> environment
> variable to the value ‘C’."

Do you recommend rephrasing it in other ways, perhaps mentioning "Rational 
Range Interpretation" ?

I should probably compile a list of combinations of os/libc/locale/gnulib under 
which sed does not behave with 
rational range. With the addition of the DFA engine (with fallback to the 
previous engine) it makes things ever more confusing (for me, at least).

For example, I see the following on Debian (latest sed from git):

    $ printf '\351\n' | LC_ALL=fr_FR.iso88591 sed -n '/[d-f]/p' | od -tx1
    0000000 e9 0a
    0000002
    $ printf '\u00e9\n' | LC_ALL=en_US.utf8 sed -n '/[d-f]/p' | od -tx1         
                    
    0000000 c3 a9 0a
    0000003

While same sed from git on Mac OS X does not match:

    $ gprintf '\351\n' | LC_ALL=fr_FR.ISO8859-1 ./sed/sed -n '/[d-f]/p' | od 
-tx1           
    0000000
    $ gprintf '\u00e9\n' | LC_ALL=fr_FR.utf8 ./sed/sed -n '/[d-f]/p' | od -tx1
    0000000

IIUC, that's because on Debian it uses glibc's "re_search", while on Mac OS it 
uses gnulib's "_rpl_re_search".
Should we perhaps change it to always use gnulib's, and have "rational range", 
at the cost of backwards-incompatability ?


comments welcomed,
thanks,
 - assaf





reply via email to

[Prev in Thread] Current Thread [Next in Thread]