bug#21916: sort -u drops unique lines with some locales

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#21916: sort -u drops unique lines with some locales

From:	Pádraig Brady
Subject:	bug#21916: sort -u drops unique lines with some locales
Date:	Sat, 14 Nov 2015 11:06:22 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

tag 21916 notabug
close 21916
stop

On 14/11/15 05:38, Christoph Anton Mitterer wrote:
> Hey.
> 
> (GNU coreutils 8.23)
> 
> Attached is a file, that, when sort -u'ed in my locale, looses lines
> which are however unique.
> 
> I've also attached the locale, since it's a custom made one, but the
> same seem to happen with "standard" locales as well, see e.g.
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
> 
> Cheers,
> Chris.
> 
> PS: Please keep me CCed, as I'm writing off list.

Unfortunately the roman numeral code points compare equal:

  $ printf '%s\n' Ⅱ Ⅰ | ltrace -e strcoll sort
  sort->strcoll("\342\205\241", "\342\205\240") = 0
  Ⅱ
  Ⅰ

If you compare at the byte level you'll get appropriate grouping:

  $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
  Ⅰ
  Ⅱ

The same goes for other similar representations,
like full width forms of latin numbers:

  $ printf '%s\n' ２ １ | ltrace -e strcoll sort
  sort->strcoll("\357\274\222", "\357\274\221") = 0
  ２
  １

That's a bit surprising, though maybe since only a limited
number of these representations are provided, it was
not thought appropriate to provide collation orders for them.

There are details on the unicode representation at:
https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
Where it says "[f]or most purposes, it is preferable to compose the Roman 
numerals
from sequences of the appropriate Latin letters"

For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate sorting:

$ printf '%s\n' I II III IV V VI VII VIII ІХ Х ХI ХII ХIII ХIV ХV ХVI ХVII 
ХVIII ХІХ | sort
I
II
III
IV
V
VI
VII
VIII
ІХ
Х
ХI
ХII
ХIII
ХIV
ХV
ХVI
ХVII
ХVIII
ХІХ

If there were only portions of the line that was appropriate to treat in the C 
locale
(not for your grouping case really, but generally for sorting for example),
then you'd need to consider transformations like enclosed, fullwidth, halfwidth 
-> ASCII
which might be done with a separate utility, and for number specific 
transformations
like the above, handled within the numfmt utility?

One thing we might do immediately, is maybe with the sort --debug option,
to provide some indication when strcoll() and memcmp() differ in direction.

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/14
- bug#21916: sort -u drops unique lines with some locales, Pádraig Brady <=
  - bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/16
  - bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/16
  - bug#21916: sort -u drops unique lines with some locales, Bob Proulx, 2015/11/16

Prev by Date: bug#21908: find -f breaks pipes ?
Next by Date: bug#21919: tee enhancement
Previous by thread: bug#21916: sort -u drops unique lines with some locales
Next by thread: bug#21916: sort -u drops unique lines with some locales
Index(es):
- Date
- Thread