[debbugs-tracker] bug#24601: closed (UTF-8 locale makes lexicographic so

From:

GNU bug Tracking System

Subject:

[debbugs-tracker] bug#24601: closed (UTF-8 locale makes lexicographic sort weird)

Date:

Mon, 03 Oct 2016 21:58:03 +0000

Your message dated Mon, 3 Oct 2016 16:57:53 -0500 with message-id <address@hidden> and subject line Re: bug#24601: UTF-8 locale makes lexicographic sort weird has caused the debbugs.gnu.org bug report #24601, regarding UTF-8 locale makes lexicographic sort weird to be marked as done. (If you believe you have received this mail in error, please contact address@hidden) -- 24601: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24601 GNU Bug Tracking System Contact address@hidden with problems

--- Begin Message --- Subject: UTF-8 locale makes lexicographic sort weird Date: Mon, 03 Oct 2016 19:54:02 +0000

coreutils-8.25 compiled from source on Fedora 24:

% echo "+00\n-0c\n+02\n-02" | src/sort

+00

-02

+02

-0c

This seems to be due to locale:

% echo "+00\n-0c\n+02\n-02" | LC_ALL=C src/sort

+00

+02

-02

-0c

echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 src/sort

+00

-02

+02

-0c

Since OS X 10.11 still comes with coreutils 5.93, I tried that:

% echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort

+00

+02

-02

-0c

I've taken a look at the Unicode collation standard, and I can't immediately see anything that explains the current (8.25) behavior.

I've also played around with <http://demo.icu-project.org/icu-bin/locexp?_=en_US.UTF-8&d_=en&x=col> and I can't come up with any set of Unicode collation options that gives the same results.

mathew

--- End Message ---

--- Begin Message --- Subject: Re: bug#24601: UTF-8 locale makes lexicographic sort weird Date: Mon, 3 Oct 2016 16:57:53 -0500 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0

tag 24601 notabug
thanks

On 10/03/2016 02:54 PM, mathew wrote:
> coreutils-8.25 compiled from source on Fedora 24:
> 
> % echo "+00\n-0c\n+02\n-02" | src/sort

Not all 'echo' programs understand \n as an escape sequence; you are
better off using the portable printf(1) when trying to demonstrate
simple programs, as in:

$ printf '+00\n-0c\n+02\n-02' | sort --debug
sort: using ‘en_US.UTF-8’ sorting rules
+00
___
-02
___
+02
___
-0c
___

> 
> This seems to be due to locale:

Indeed, it is entirely due to locale, and hence is not a bug but rather
POSIX-mandated behavior that sort honors your locale rules.

> Since OS X 10.11 still comes with coreutils 5.93, I tried that:
> 
> % echo "+00\n-0c\n+02\n-02" | LC_ALL=en_US.UTF-8 sort
> +00
> +02
> -02
> -0c

The sad thing is that POSIX says that locale authors (for all but the C
locale) have absolute control over all sorts of fiddly aspects of how
strcoll() behaves, and that just because two vendors declare that their
locale is named en_US.UTF-8 does NOT require those two vendors to have
the same locale definition.  So the collation rules between two
different platforms are very likely different, based on whoever wrote
the locale file for that platform, and what bug fixes have been
incorporated into the locale definition over time.

It appears that you are complaining that between your two systems, one
sorts the line '-02' before '+02' (even though it was specified later);
while the other system leaves the two lines unchanged with '+02' first.
If strcoll("-02", "+02") says the two strings collate identically, then
the two lines should have a final tie-breaker based on byte values
(which would put +02 first by byte values); but if the locale has
secondary (or even tertiary) sorting rules that put '-' before '+' (even
after the primary pass ignores punctuation and focuses only on
alphanumerics), then there is no chance for the tiebreaker rule to kick in.

Sadly, even the 'sort --debug' option is not able to easily demonstrate
the subtleties that go into the strcoll() function's rules for obeying
the locale sorting specification.

> 
> I've taken a look at the Unicode collation standard, and I can't
> immediately see anything that explains the current (8.25) behavior.

Locale rules are not required to follow Unicode collation rules, at
least not by POSIX.  It would be nice if all locales were synonymous
across platforms and behaved equivalently to Unicode rules, and in fact
glibc locale authors try hard to obey Unicode when writing locales, but
reading the Unicode collation standard will not tell you how a
particular locale will behave; only reading that locale's definition
will tell you what it will do.

At any rate, I don't see any bug in coreutils proper.  Perhaps you have
uncovered a problem in glibc's locale definition for changing in
behavior over time compared to what you think it should do (you didn't
even state what you were EXPECTING to see, only that the output differed
from your expectations); but if so, that is better reported to the glibc
list.  In the meantime, I'm closing this as not a coreutils bug,
although you can feel free to continue the conversation.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

--- End Message ---

[Prev in Thread]

Current Thread

[Next in Thread]

[debbugs-tracker] bug#24601: closed (UTF-8 locale makes lexicographic sort weird), GNU bug Tracking System <=