[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#11621: questionable locale sorting order (especially as related to c
From: |
Pádraig Brady |
Subject: |
bug#11621: questionable locale sorting order (especially as related to char ranges in REs) |
Date: |
Mon, 04 Jun 2012 09:48:52 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 |
On 06/04/2012 06:03 AM, Linda A. Walsh wrote:
>
>
> Pádraig Brady wrote:
>> On 06/03/2012 11:13 PM, Linda Walsh wrote:
>>> Within in the past few years, use of ranges in RE's has become
>>> unreliable due to some locale changes sorting their native character
>>> sets such that a<A<b<B<y<Y<z<Z (vs. 'C' ordering A<B<Y<Z<a<b<y<z).
>>>
>>> There seems to be a problem in when a user has set their system to use
>>> Unicode, it is no longer using the locale specific character set
>>> (iso-8859-x,
>>> or others).
> ----
> To clarify my above statement:
>
>
> There seems to be a problem in when a user has set their system to use
> Unicode: It is no longer using the locale specific character set (iso-8859-x,
> or others) -- ***or*** *their* *orderings*. I.e. Unicode defines a collation
> order -- I don't know that they others do ('C' does, but I don't know about
> other locale-specific character sets).
>
>
>> It's not specific to "unicode". Sorting in a iso-8859-1 charset
>> results in locale ordering:
> ----
> Can you cite a source specifying the sort/collation order of the
> iso-8859-1 charset that would prove that it is not-conforming to the collation
> specification for that charset?
>
> I.e. If there is no official source, then the order with that charset
> is "undefined", and while it may not be desirable, returning a<A<b<B, would
> not
> be "an error".
It's a charset. Of course the order is defined. Try: man iso-8859-1
The relative ordering can be trivially inferred from the command I presented.
But to be explicit:
$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=en_US sort | iconv -f
iso-8859-1
a
A
á
b
$ printf "%s\n" A b a á | iconv -t iso-8859-1 | LC_ALL=C sort | iconv -f
iso-8859-1
A
a
b
á
>
>
>
>
>>> http://unicode.org/charts/case/chart_Latin.htm.
>>
>> http://unicode.org/charts/case/chart_Latin.html
> ---
> ^^Correct^^ (typho)
>
>>> Temporarily ignoring accents, only talking about lower and upper
>>> case letters, ...
>>
>> Well case comparison is a complicated area.
> ----
> A bit, but it's mostly just wrong in the gnu library concerning unicode,
> and,
> as you are pointing out -- the 'C' encoding as well.
> the 'C' locale was the original charset used by the 'C' language -- only 8
> bits
> wide.
>
> So how can it sort characters beyond the lower 256?
> This would seem to be meaningless and bugs output.
http://www.pixelbeat.org/docs/utf8_programming.html
> Is it?... When the case comparison ordering is specified in a
> standard, it makes it fairly clear that one is either compliant with the
> standard
> or not.
>
> In this case, the Gnu sort/collation lib is not Unicode/UTF-8 compliant.
>
> What happens in other charsets may or may not be covered under some
> other standard -- e.g. the 'C'/ascii ordering is specified. But I don't know
> if others have relevant standards or not.
>
>>
>> For the special case of discounting accented chars etc.
>> you can use an attribute of the well designed UTF-8.
> ---
> This is not exactly the point -- the point is that the core sort
> DOESN'T use that ordering. That's the bug I am reporting.
Well you can't generally exclude accents.
>
> In reporting this, I'm trying to keep the argument 'simple' and focus on
> the problem of widely used ranges in the first 256 code-points of
> Unicode.
>
> Unicode gives a fairly extensive algorithm for handling accents,
> but I didn't want to complicate the discussion by "going there". Please
> focus this bug on the lower 128 code points, as full unicode compliance
> with the full collation algorithm that is specified is likely to be a
> larger task. HOWEVER, fixing the sorting/collation order of the lower
> 127 code points, is, comparatively a small task that conceivably could be
> fixed in the next release.
lower 127 = ASCII. If your input data is ASCII, just use LC_ALL=C.
>> Enabling traditional byte comparison on (normalized) UTF-8 data
>> will result in data sorted in Unicode code point order:
>> A b a á => A a b á
>
> But you are missing the point (as well as raising an interesting
> 'feature'(?bug?)).
>
> How is it that 'C' collation collates characters that are outside the ascii
> range?
Well whether C should be a "unicode" or "ascii" charset is a whole different
kettle of fish. I was just referring (as per the link above), that
UTF8 is well designed so that it works with many traditional single byte
functions.
> I.e. -- you can't interpret input data as 'unicode' in the 'C' locale.
> So how does this work in the 'C' local? AND more importantly -- it SHOULD
> work
> when charset is unicode (UTF-8)... and does not. Test prog:
> ---------------
> #!/bin/bash
> set -m
> # vals to test:
> declare -a vals=( A a B b X x Y y Z z Ⅷ Ⅴ Ⅲ Ⅰ Ⅿ Ⅽ ⅶ ⅼ ⅲ )
> COLLATE_ORDER=C
>
> function isatty {
> local fd=${1:-1} ;
> 0<&$fd tty -s
> }
>
> function ord {
> local nl="";
> isatty && nl="\n"
> printf "%d$nl" "'$1"
> }
>
> function background_print {
> readarray -t inp
> for ch in "address@hidden"; {
> printf "%s (U+%x)\n" "$ch" "$(ord "$ch")"
> }
> }
>
>
> printf "%s\n" "address@hidden" |
> LC_COLLATE=$COLLATE_ORDER sort |
> background_print
>
> ------------------------------------
>
> Note, that the above produces:
>
> /tmp/stest
> Ⅷ (U+2167)
> Ⅴ (U+2164)
> Ⅲ (U+2162)
> Ⅰ (U+2160)
> Ⅿ (U+216f)
> Ⅽ (U+216d)
> ⅶ (U+2176)
> ⅼ (U+217c)
> ⅲ (U+2172)
> a (U+61)
> A (U+41)
> b (U+62)
> B (U+42)
> x (U+78)
> X (U+58)
> y (U+79)
> Y (U+59)
> z (U+7a)
> Z (U+5a)
>
> NOT the output you showed...Seems there's a bug in the C collation order?
Note C doesn't use a collation order, it's simple byte comparison.
Seems there may be a bug in your script?
Also ensure that LC_ALL is not set, which will override LC_COLLATE.
$ printf "%s\n" A a B b 2 1 Ⅷ ⅶ ⅲ | LC_COLLATE=C sort
1
2
A
B
a
b
Ⅷ
ⅲ
ⅶ
>
> Changing collation order to UTF-8:
>
> Same thing:
> /tmp/stest
> Ⅷ (U+2167)
> Ⅴ (U+2164)
> Ⅲ (U+2162)
> Ⅰ (U+2160)
> Ⅿ (U+216f)
> Ⅽ (U+216d)
> ⅶ (U+2176)
> ⅼ (U+217c)
> ⅲ (U+2172)
> a (U+61)
> A (U+41)
> b (U+62)
> B (U+42)
> x (U+78)
> X (U+58)
> y (U+79)
> Y (U+59)
> z (U+7a)
> Z (U+5a)
>
>
>>> I would assert this is a serious bug that should be addressed ASAP...
>>
>> As for the question in the subject for handling ranges in REs,
>> there has been recent work in changing as you suggest:
>>
>> http://lists.gnu.org/archive/html/bug-gnulib/2011-06/threads.html#00105
> ----
>
> Recent?
?
> The most recent posts on that thread look to be from June of last year.
> I.e. a year ago.
>
> I'm trying to stay focused on specific problems -- UTF-8 ordering is defined.
> the gnu library doesn't follow it.
>
> Major problem with so many progs relying on the lib!...
cheers,
Pádraig.