bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19021: Possible bug in sort


From: Leslie S Satenstein
Subject: bug#19021: Possible bug in sort
Date: Tue, 11 Nov 2014 18:27:49 +0000 (UTC)

Why not have used  sort  -t ',' -k 1n  ?
 Regards 
 Leslie
 Mr. Leslie Satenstein
Montréal Québec, Canada


 
      From: Eric Blake <address@hidden>
 To: Ben Mendis <address@hidden>; address@hidden 
 Sent: Tuesday, November 11, 2014 12:39 PM
 Subject: bug#19021: Possible bug in sort
   
tag 19021 notabug
thanks

On 11/11/2014 09:39 AM, Ben Mendis wrote:
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
> 
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

Thanks for the report.  Rather than making us chase down links, why not
provide the information inline with your email?

> 
> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

Try using the --debug option to see what is really happening.  The bug
is NOT in sort (which correctly obeyed your locale rules and incorrect
command line), but in your command line (because you didn't tell sort
where to quit parsing numbers).

I'm going to distill it down to a smaller input that still expresses the
same "swapped" lines:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,73,67,6
_________
_________
2,68,61,7
_________
_________
1,69,55,14
__________
__________
2,71,59,12
__________
__________

See what's happening? The -k1n argument says to start parsing at field
1, but continue parsing until either the input is no longer numeric or
until the end of line is reached (even if it goes into field 2 or
beyond). Since commas are silently ignored in the en_US.UTF-8 locale
when parsing a number, sort is thus comparing the values 268617 and
1695514, and the sort was correct.

Now, try telling sort that it must parse a numeric field, but must END
the parse at the end of the first field (if not sooner due to end of
number):

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1,1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

Or try using a locale where ',' is NOT part of a valid number:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | LC_ALL=C sort -t, -k1n --debug
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________


> 
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
> -k 1n

Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
from the stackoverflow site to your email).  But yeah, when you truncate
to a smaller number, you are comparing different values (17367 is less
than 26861).



> 
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.

-n is specified by POSIX as parsing integers according to the current
locale's definition.  -g is a GNU extension, which says to parse
floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing
floating point stops at the first comma, while parsing integers does not:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1g --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

I don't know why libc chose to make strtoll() ignore commas while
strtold() does not, when not in the C locale.

But at any rate, I hope I've demonstrated that the bug was in your usage
and not in sort.  So I'm closing this bug, although you should feel free
to add further comments or questions.  You may also want to read the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
[Hmm - we should update that FAQ to mention the --debug option]

-- 
Eric Blake  eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


   

reply via email to

[Prev in Thread] Current Thread [Next in Thread]