bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13638: linux-sort inconsistency


From: Eric Blake
Subject: bug#13638: linux-sort inconsistency
Date: Wed, 06 Feb 2013 11:21:07 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

tag 13638 notabug
thanks

On 02/06/2013 03:49 AM, Knud Arnbjerg Christensen wrote:
> Hi
> linux-sort inconsistency occours when sorting an alfpha-numeric field,
> then the order becomes different depending on if the following field is 
> numeric (file 1) or alfanumeric (file 2). In case one the length of the 
> shorter fields is extended by ´zeros´ in case 2 the fields is extended by 
> blanks which cause the different sorting order.

This is most likely a product of your locale; you may find this FAQ
addresses your issue:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

> sort -k 1 file1>file1-sorted

Oops - this says to use the first field _and on to the rest of the line_
as the single sort key.  You probably want to limit the sort to just the
first field, using -k1,1 instead.

Extracting portions of just 3 lines that went differently between your
two invocations:

> Seq_10187 00001   x 00181 00553
> Seq_10190 00001   x 00553 01182
> Seq_101903 00001   x 00586 00331

vs.

> Seq_10187 incomplete B4DN50 Gap junction protein   640
> Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1   416
> Seq_10190 incomplete HSF2 Heat shock factor protein 2   1273

Using sort's --debug option will make it quite obvious what is going on:

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | sort -k 1 --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
Seq_10187 incomplete
____________________
____________________
Seq_101903 incomplete
_____________________
_____________________
Seq_10190 incomplete
____________________
____________________


You specified the entire line as the first sort key, and in the
en_US.UTF-8 locale, punctuation (including space) is ignored during
collation.  Since "903i" sorts before "90in" when spacing is removed,
that explains why the sort order differs based on whether the text after
the space is numeric or alphabetic.  Now note what happens when you
force the C locale, where every byte is significant during collation,
and where "90 in" sorts before "903 i":

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | LC_ALL=C sort -k 1 --debug
sort: using simple byte comparison
Seq_10187 incomplete
____________________
____________________
Seq_10190 incomplete
____________________
____________________
Seq_101903 incomplete
_____________________
_____________________

Meanwhile, what you probably wanted is to sort by JUST the first field
(note how I added -b as suggested, and used -k1,1 instead of -k1).

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | sort -b -k 1,1 --debug
sort: using ‘en_US.UTF-8’ sorting rules
Seq_10187 incomplete
_________
____________________
Seq_10190 incomplete
_________
____________________
Seq_101903 incomplete
__________
_____________________


As such, I'm closing this bug report, although you may feel free to add
further comments or questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]