Re: Case insensitivity seems to ignore lower bound of interval

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case insensitivity seems to ignore lower bound of interval

From:	Aharon Robbins
Subject:	Re: Case insensitivity seems to ignore lower bound of interval
Date:	Fri, 29 Apr 2011 10:55:23 +0300
User-agent:	Heirloom mailx 12.4 7/29/08

Eric,

Hi.

> At this point, my personal opinion is that all intervals are simply
> unusable with gawk, as they gives results that are both unpredictable and
> counter-intuitive with any locale other than "C".

Indeed - this is why the (development) documentation clearly explains
not to use them, and why the development code converts [R-Z] into
[RSTUVWXYZ].  I have been fighting this issue for years now.

Davide Brini states:

> You seem to think this is gawk-specific, but in fact any locale-aware tool
> that uses regular expressions behaves the same (try eg with sed or grep).

And this too is correct.  POSIX locales (in my not-so-humble opinion) are
a total and utter botch.

(I'll point out also that all of this happens down in the library routines
that gawk uses, and which are (complicated, messy) black boxes as far as
I'm concerned.)

> Asking non-English users to write explicit [abcdefghijklmnopqrstuvwxyz] 
> choices is not a solution either.

[[:lower:]], [[:upper:]] and so on exist to mitigate this issue. They are
not perfect solutions.

> Collation [...]

Collation has to do with sorting order, and less so with regular expression
matching.  Gawk doesn't support [[=e=]] which is supposed to match all
versions of the letter 'e'.

> My point is that [R-Z] should either be defined to [rRsStTuUvVwWxXyYzZ]
> or to [RSTUVWXYZ], but not to a surprising thing like [RsStTuUvVwWxXyYzZ]
> (no "r").  The current situation where [R-Z] catches "t" but does not
> catch "r" is really weird, even if it's compatible with the freedom
> offered by the POSIX standard.

I agree, which is why I've clarified the doc and changed the code, but again,
this is not a gawk-specific issue but a general locale issue.

> One technical possibility would be to simply use Unicode code positions.

Unfortunately, no.  Gawk is used in many parts of the world where Unicode
is not the standard character set (Japan, China, etc.) and restricting
gawk to just Unicode would not be a good idea.  You can today use
octal escapes inside [...] if you want. (It's even documented! :-).
But that's only good for single bytes.

Maybe in another 10 years it'll be safe to move exclusively to Unicode.

To sum up, it's a thorny issue, of which I'm well aware, but there is
no simple easy solution.

If you still disagree, then I'm sorry, there's nothing else I can do
to help.

Thanks,

Arnold

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Case insensitivity seems to ignore lower bound of interval, (continued)

Prev by Date: Re: Case insensitivity seems to ignore lower bound of interval
Next by Date: Re: Case insensitivity seems to ignore lower bound of interval
Previous by thread: Re: Case insensitivity seems to ignore lower bound of interval
Next by thread: Gawk debugger
Index(es):
- Date
- Thread