[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Case insensitivity seems to ignore lower bound of interval
From: |
Eric Bischoff |
Subject: |
Re: Case insensitivity seems to ignore lower bound of interval |
Date: |
Thu, 28 Apr 2011 11:54:15 +0200 |
User-agent: |
KMail/1.13.6 (Linux/2.6.38-8-generic; KDE/4.6.2; x86_64; ; ) |
Le jeudi 28 avril 2011 10:28:01, address@hidden a écrit :
> > Then it's using strcmp() that is plain wrong :-(.
>
> Gawk does not use strcmp() for regex matching. (You may not have been
> saying that it did, I admit.)
>
> The issue is indeed as described in the previous mails, and the
> development version of the gawk doc explains these issues considerably
> better.
>
> I recommend checking out a copy from the git repo on savannah.gnu.org
> and reviewing the doc; I will welcome feedback on it!
Although it is nice to have accurate, comprehensive and up-to-date
documentation, I don't think this is a documentation issue. Sorry if I sound
stubborn, but I still think this is simply a bug ;-).
At this point, my personal opinion is that all intervals are simply unusable
with gawk, as they gives results that are both unpredictable and counter-
intuitive with any locale other than "C".
Asking every user to run "LANG=C gawk" is not a solution.
Asking non-English users to write explicit [abcdefghijklmnopqrstuvwxyz]
choices is not a solution either.
Collation would be really useful to non-English users ("Métier = MÉTIER =
METIER). But if that is too complicated, it would be okay too to have no
collation at all, as long as intervals are correcly defined (i.e. [R-Z] =
[RSTUVWXYZ] and [r-z] = [rstuvwxyz]).
My point is that [R-Z] should either be defined to [rRsStTuUvVwWxXyYzZ] or to
[RSTUVWXYZ], but not to a surprising thing like [RsStTuUvVwWxXyYzZ] (no "r").
The current situation where [R-Z] catches "t" but does not catch "r" is really
weird, even if it's compatible with the freedom offered by the POSIX standard.
One technical possibility would be to simply use Unicode code positions. Code
for "A" is 0041, code for "a" is 0061, code for "à" is 00E0, code for angstrom
sign ("Å") is 212B. For people using 7-bits or 8-bits locales (ASCII, ISO
latin-1, etc),there is always a conversion to unicode available that could be
made upfront. That would not help in doing collation, but at least no lower
case letter would be part of [R-Z].
I fully realize that awk was born in the seventies, at which time no one was
even thinking at characters on more than one byte, but the future is definitely
on the side of multibyte characters. That's the reason why I think that a
solution to this precise problem could rely on unicode's codes. Of course I
don't say that this is the only solution, not even the better one.
--
Éric Bischoff - Bureau Cornavin
Technical writing and translations
http://www.bureau-cornavin.com
(+33) 3 68 46 00 85
sip:address@hidden
- Re: Case insensitivity seems to ignore lower bound of interval, (continued)
Re: Case insensitivity seems to ignore lower bound of interval, arnold, 2011/04/28
Re: Case insensitivity seems to ignore lower bound of interval, Paul Jarc, 2011/04/28
Re: Case insensitivity seems to ignore lower bound of interval, Eric Bischoff, 2011/04/29