bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case insensitivity seems to ignore lower bound of interval


From: Davide Brini
Subject: Re: Case insensitivity seems to ignore lower bound of interval
Date: Thu, 28 Apr 2011 11:43:52 +0100
User-agent:

On Thu, 28 Apr 2011 12:21:37 +0200
Eric Bischoff <address@hidden> wrote:

> Le jeudi 28 avril 2011 12:00:33, vous avez écrit :
> > You seem to think this is gawk-specific, but in fact any locale-aware
> > tool that uses regular expressions behaves the same (try eg with sed or
> > grep).
> 
> Not here:
> 
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[r-z]/X/g'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
> ijklmnopqrstuvwxyz

This is strange, since with GNU sed 4.2.1 I get

$ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
ijklmnopqrXXXXXXXX

And this shows the converse:

$ echo 'IJKLMNOPQRSTUVWXYZ' | sed 's/[r-z]/_/g'
IJKLMNOPQ________Z

("[r-z]" does not include "Z", because of ... vVwWxXyYzZ)

I get the above results with all the UTF-8 locales I can try on this box
(admittedly, not too many).

> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[r-z]", "X"); print}'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[R-Z]", "X"); print}'
> ijklmnopqrXXXXXXXX

Same here.
 
> $ echo 'ijklmnopqr'| grep "[r-z]"
> ijklmnopqr
> $ echo 'ijklmnopqr'| grep "[R-Z]"

This is expected, since "[R-Z]" does NOT match "r" in the locales under
discussion. But nonetheless, you are right that grep seems to behave
differently, although I was almost sure I had seen it show the same
behavior at some point; I may be misremembering.

Furthermore, the above is in direct contradiction to grep's documentation,
which states

"Within a bracket expression, a "range expression" consists of two
characters separated by a hyphen.  It matches any single character that
sorts between the two characters, inclusive, using the locale's
collating sequence and character set.  For example, in the default C
locale, `[a-d]' is equivalent to `[abcd]'.  Many locales sort
characters in dictionary order, and in these locales `[a-d]' is
typically not equivalent to `[abcd]'; it might be equivalent to
`[aBbCcDd]', for example.  To obtain the traditional interpretation of
bracket expressions, you can use the `C' locale by setting the `LC_ALL'
environment variable to the value `C'".

So I would definitely expect grep to follow awk's and sed's behavior.


For further thought, "sort" is another command whose collating order is
affected by the locale ("consistently", so to speak, with awk and sed).

$ printf '%s\n' A b Z | sort
A
b
Z

$ printf '%s\n' A b Z | LC_ALL=C sort
A
Z
b

-- 
D.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]