[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case insensitivity seems to ignore lower bound of interval

From: Davide Brini
Subject: Re: Case insensitivity seems to ignore lower bound of interval
Date: Tue, 26 Apr 2011 18:49:27 +0100

On Tue, 26 Apr 2011 17:27:49 +0200 Eric Bischoff <address@hidden>

> Hi all,
> $ echo "ijklmnopqrstuvwxyz" | awk '{ gsub(/[R-Z}/, "X"); print }
> ijklmnopqrXXXXXXXX
> please notice that "r" is not matched, i.e. case insensitivity is applied
> only to [S-Z] interval.
> $ awk --version
> GNU Awk 3.1.7
> (...)
> $ echo $LANG
> fr_FR.UTF-8
> The problem does not appear when locale is C.
> The problem does not appear when interval is specified as [r-z] (lower
> case)..
> This contradicts http://www.gnu.org/software/gawk/manual/gawk.html#Locales
> which documents 
>      $ echo something1234abc | gawk '{ sub("[A-Z]*$", ""); print }'
> as returning
>      something1234
> while it returns
>      something1234a
> Bug reproduced both on Ubuntu Natty beta 2 and on Fedora 15.

This is not a bug but expected behavior (not that I agree, but that's the
way it is).

The executive summary is that many non-C locales have different collation
orders (mostly dictionary order, regardless of case). In those locales,
an expression like [R-Z] may match (at least) "RsStTuUvVwWxXyYzZ", plus
perhaps any other character that sorts between them (note that the above
does not include "r"). Similarly for other range expressions.

To work around, either use LC_ALL=C to get plain ASCII orgering, or use
[[:upper:]] or [[:lower:]] etc. as appropriate, or if using partial ranges,
make it explicit eg [RSTUVWXYZ].


reply via email to

[Prev in Thread] Current Thread [Next in Thread]