Re: Case insensitivity seems to ignore lower bound of interval

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case insensitivity seems to ignore lower bound of interval

From:	Davide Brini
Subject:	Re: Case insensitivity seems to ignore lower bound of interval
Date:	Thu, 28 Apr 2011 11:43:52 +0100
User-agent:

On Thu, 28 Apr 2011 12:21:37 +0200
Eric Bischoff <address@hidden> wrote:

> Le jeudi 28 avril 2011 12:00:33, vous avez écrit :
> > You seem to think this is gawk-specific, but in fact any locale-aware
> > tool that uses regular expressions behaves the same (try eg with sed or
> > grep).
> 
> Not here:
> 
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[r-z]/X/g'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
> ijklmnopqrstuvwxyz

This is strange, since with GNU sed 4.2.1 I get

$ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
ijklmnopqrXXXXXXXX

And this shows the converse:

$ echo 'IJKLMNOPQRSTUVWXYZ' | sed 's/[r-z]/_/g'
IJKLMNOPQ________Z

("[r-z]" does not include "Z", because of ... vVwWxXyYzZ)

I get the above results with all the UTF-8 locales I can try on this box
(admittedly, not too many).

> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[r-z]", "X"); print}'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[R-Z]", "X"); print}'
> ijklmnopqrXXXXXXXX

Same here.

> $ echo 'ijklmnopqr'| grep "[r-z]"
> ijklmnopqr
> $ echo 'ijklmnopqr'| grep "[R-Z]"

This is expected, since "[R-Z]" does NOT match "r" in the locales under
discussion. But nonetheless, you are right that grep seems to behave
differently, although I was almost sure I had seen it show the same
behavior at some point; I may be misremembering.

Furthermore, the above is in direct contradiction to grep's documentation,
which states

"Within a bracket expression, a "range expression" consists of two
characters separated by a hyphen.  It matches any single character that
sorts between the two characters, inclusive, using the locale's
collating sequence and character set.  For example, in the default C
locale, `[a-d]' is equivalent to `[abcd]'.  Many locales sort
characters in dictionary order, and in these locales `[a-d]' is
typically not equivalent to `[abcd]'; it might be equivalent to
`[aBbCcDd]', for example.  To obtain the traditional interpretation of
bracket expressions, you can use the `C' locale by setting the `LC_ALL'
environment variable to the value `C'".

So I would definitely expect grep to follow awk's and sed's behavior.

For further thought, "sort" is another command whose collating order is
affected by the locale ("consistently", so to speak, with awk and sed).

$ printf '%s\n' A b Z | sort
A
b
Z

$ printf '%s\n' A b Z | LC_ALL=C sort
A
Z
b

-- 
D.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Case insensitivity seems to ignore lower bound of interval, (continued)
- Re: Case insensitivity seems to ignore lower bound of interval, arnold, 2011/04/28
  - Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
    - Message not available
    - Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini <=
    - Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
    - Message not available
    - Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
    - Message not available
    - Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Paul Jarc, 2011/04/28
    - Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/29
    - Re: Case insensitivity seems to ignore lower bound of interval, Aharon Robbins, 2011/04/29

Prev by Date: Re: Case insensitivity seems to ignore lower bound of interval
Next by Date: Re: Case insensitivity seems to ignore lower bound of interval
Previous by thread: Re: Case insensitivity seems to ignore lower bound of interval
Next by thread: Re: Case insensitivity seems to ignore lower bound of interval
Index(es):
- Date
- Thread