bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case insensitivity seems to ignore lower bound of interval


From: Eric Bischoff
Subject: Re: Case insensitivity seems to ignore lower bound of interval
Date: Thu, 28 Apr 2011 11:54:15 +0200
User-agent: KMail/1.13.6 (Linux/2.6.38-8-generic; KDE/4.6.2; x86_64; ; )

Le jeudi 28 avril 2011 10:28:01, address@hidden a écrit :
> > Then it's using strcmp() that is plain wrong :-(.
> 
> Gawk does not use strcmp() for regex matching. (You may not have been
> saying that it did, I admit.)
> 
> The issue is indeed as described in the previous mails, and the
> development version of the gawk doc explains these issues considerably
> better.
>
> I recommend checking out a copy from the git repo on savannah.gnu.org
> and reviewing the doc; I will welcome feedback on it!

Although it is nice to have accurate, comprehensive and up-to-date 
documentation, I don't think this is a documentation issue. Sorry if I sound 
stubborn, but I still think this is simply a bug ;-).

At this point, my personal opinion is that all intervals are simply unusable 
with gawk, as they gives results that are both unpredictable and counter-
intuitive with any locale other than "C".

Asking every user to run "LANG=C gawk" is not a solution.

Asking non-English users to write explicit [abcdefghijklmnopqrstuvwxyz] 
choices is not a solution either.

Collation would be really useful to non-English users ("Métier = MÉTIER = 
METIER). But if that is too complicated, it would be okay too to have no 
collation at all, as long as intervals are correcly defined (i.e. [R-Z] = 
[RSTUVWXYZ] and [r-z] = [rstuvwxyz]).

My point is that [R-Z] should either be defined to [rRsStTuUvVwWxXyYzZ]  or to 
[RSTUVWXYZ], but not to a surprising thing like [RsStTuUvVwWxXyYzZ] (no "r"). 
The current situation where [R-Z] catches "t" but does not catch "r" is really 
weird, even if it's compatible with the freedom offered by the POSIX standard.

One technical possibility would be to simply use Unicode code positions. Code 
for "A" is 0041, code for "a" is 0061, code for "à" is 00E0, code for angstrom 
sign ("Å") is 212B. For people using 7-bits or 8-bits locales (ASCII, ISO 
latin-1, etc),there is always a conversion to unicode available that could be 
made upfront. That would not help in doing collation, but at least no lower 
case letter would be part of [R-Z].

I fully realize that awk was born in the seventies, at which time no one was 
even thinking at characters on more than one byte, but the future is definitely 
on the side of multibyte characters. That's the reason why I think that a 
solution to this precise problem could rely on unicode's codes. Of course I 
don't say that this is the only solution, not even the better one.


-- 
Éric Bischoff - Bureau Cornavin
Technical writing and translations
http://www.bureau-cornavin.com
(+33) 3 68 46 00 85
sip:address@hidden



reply via email to

[Prev in Thread] Current Thread [Next in Thread]