bug#21916: sort -u drops unique lines with some locales

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#21916: sort -u drops unique lines with some locales

From:	Bob Proulx
Subject:	bug#21916: sort -u drops unique lines with some locales
Date:	Sun, 15 Nov 2015 17:11:37 -0700
User-agent:	Mutt/1.5.24 (2015-08-30)

Pádraig Brady wrote:
> Christoph Anton Mitterer wrote:
> > Attached is a file, that, when sort -u'ed in my locale, looses lines
> > which are however unique.
> > 
> > I've also attached the locale, since it's a custom made one, but the
> > same seem to happen with "standard" locales as well, see e.g.
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=695489
> > 
> > PS: Please keep me CCed, as I'm writing off list.
> 
> If you compare at the byte level you'll get appropriate grouping:
> 
>   $ printf '%s\n' Ⅱ Ⅰ | LC_ALL=C sort
>   Ⅰ
>   Ⅱ

It is also possible to set only LC_COLLATE=C and not set everything to C.

> The same goes for other similar representations,
> like full width forms of latin numbers:
> 
>   $ printf '%s\n' ２ １ | ltrace -e strcoll sort
>   sort->strcoll("\357\274\222", "\357\274\221") = 0
>   ２
>   １
>
> That's a bit surprising, though maybe since only a limited
> number of these representations are provided, it was
> not thought appropriate to provide collation orders for them.

Hmm...  Seems questionable to me.

> There are details on the unicode representation at:
> https://en.wikipedia.org/wiki/Numerals_in_Unicode#Roman_numerals_in_Unicode
> Where it says "[f]or most purposes, it is preferable to compose the Roman 
> numerals
> from sequences of the appropriate Latin letters"
> 
> For example you could mix ISO 8859-1 and ISO 8859-5 to get appropriate 
> sorting:

One can transliterate them using 'iconv'.

  printf '%s\n' Ⅱ Ⅰ ２ １ | iconv -f UTF-8 -t ASCII//TRANSLIT | sort
  1
  2
  I
  II

Bob

[Prev in Thread]

Current Thread

[Next in Thread]

bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/14
- bug#21916: sort -u drops unique lines with some locales, Pádraig Brady, 2015/11/14
  - bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/16
  - bug#21916: sort -u drops unique lines with some locales, Christoph Anton Mitterer, 2015/11/16
  - bug#21916: sort -u drops unique lines with some locales, Bob Proulx <=

Prev by Date: bug#21919: tee enhancement
Next by Date: bug#21926: du --summarize prunes subdirectories from output.
Previous by thread: bug#21916: sort -u drops unique lines with some locales
Next by thread: bug#21919: tee enhancement
Index(es):
- Date
- Thread