[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Case insensitivity seems to ignore lower bound of interval
From: |
Davide Brini |
Subject: |
Re: Case insensitivity seems to ignore lower bound of interval |
Date: |
Thu, 28 Apr 2011 11:43:52 +0100 |
User-agent: |
|
On Thu, 28 Apr 2011 12:21:37 +0200
Eric Bischoff <address@hidden> wrote:
> Le jeudi 28 avril 2011 12:00:33, vous avez écrit :
> > You seem to think this is gawk-specific, but in fact any locale-aware
> > tool that uses regular expressions behaves the same (try eg with sed or
> > grep).
>
> Not here:
>
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[r-z]/X/g'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
> ijklmnopqrstuvwxyz
This is strange, since with GNU sed 4.2.1 I get
$ echo 'ijklmnopqrstuvwxyz'| sed 's/[R-Z]/X/g'
ijklmnopqrXXXXXXXX
And this shows the converse:
$ echo 'IJKLMNOPQRSTUVWXYZ' | sed 's/[r-z]/_/g'
IJKLMNOPQ________Z
("[r-z]" does not include "Z", because of ... vVwWxXyYzZ)
I get the above results with all the UTF-8 locales I can try on this box
(admittedly, not too many).
> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[r-z]", "X"); print}'
> ijklmnopqXXXXXXXXX
> $ echo 'ijklmnopqrstuvwxyz'| awk '{gsub("[R-Z]", "X"); print}'
> ijklmnopqrXXXXXXXX
Same here.
> $ echo 'ijklmnopqr'| grep "[r-z]"
> ijklmnopqr
> $ echo 'ijklmnopqr'| grep "[R-Z]"
This is expected, since "[R-Z]" does NOT match "r" in the locales under
discussion. But nonetheless, you are right that grep seems to behave
differently, although I was almost sure I had seen it show the same
behavior at some point; I may be misremembering.
Furthermore, the above is in direct contradiction to grep's documentation,
which states
"Within a bracket expression, a "range expression" consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, inclusive, using the locale's
collating sequence and character set. For example, in the default C
locale, `[a-d]' is equivalent to `[abcd]'. Many locales sort
characters in dictionary order, and in these locales `[a-d]' is
typically not equivalent to `[abcd]'; it might be equivalent to
`[aBbCcDd]', for example. To obtain the traditional interpretation of
bracket expressions, you can use the `C' locale by setting the `LC_ALL'
environment variable to the value `C'".
So I would definitely expect grep to follow awk's and sed's behavior.
For further thought, "sort" is another command whose collating order is
affected by the locale ("consistently", so to speak, with awk and sed).
$ printf '%s\n' A b Z | sort
A
b
Z
$ printf '%s\n' A b Z | LC_ALL=C sort
A
Z
b
--
D.
- Re: Case insensitivity seems to ignore lower bound of interval, (continued)
Re: Case insensitivity seems to ignore lower bound of interval, arnold, 2011/04/28
- Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
- Message not available
- Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
- Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
- Message not available
- Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/28
- Re: Case insensitivity seems to ignore lower bound of interval, Davide Brini, 2011/04/28
Re: Case insensitivity seems to ignore lower bound of interval, Paul Jarc, 2011/04/28
Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/29
Re: Case insensitivity seems to ignore lower bound of interval, Aharon Robbins, 2011/04/29