emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#19021: closed (Possible bug in sort)


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#19021: closed (Possible bug in sort)
Date: Tue, 11 Nov 2014 17:40:05 +0000

Your message dated Tue, 11 Nov 2014 10:39:13 -0700
with message-id <address@hidden>
and subject line Re: bug#19021: Possible bug in sort
has caused the debbugs.gnu.org bug report #19021,
regarding Possible bug in sort
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
19021: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19021
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: Possible bug in sort Date: Tue, 11 Nov 2014 11:39:12 -0500
http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc

Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t , -k 1n

Using 'g' instead of 'n' also produces the expected results, but I'm not clear on what the difference is between 'g' and 'n'.

Tested with sort 8.21 on Slackware64-current.

--- End Message ---
--- Begin Message --- Subject: Re: bug#19021: Possible bug in sort Date: Tue, 11 Nov 2014 10:39:13 -0700 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0
tag 19021 notabug
thanks

On 11/11/2014 09:39 AM, Ben Mendis wrote:
> http://stackoverflow.com/questions/26869717/why-does-sort-seem-to-sort-a-field-incorrectly-based-on-the-presence-or-absenc
> 
> Data is here: https://gist.github.com/anonymous/2a7beb4871b25ae8f8b3

Thanks for the report.  Rather than making us chase down links, why not
provide the information inline with your email?

> 
> This results in line 7 being sorted incorrectly: sort -t , -k 1n < weird.csv

Try using the --debug option to see what is really happening.  The bug
is NOT in sort (which correctly obeyed your locale rules and incorrect
command line), but in your command line (because you didn't tell sort
where to quit parsing numbers).

I'm going to distill it down to a smaller input that still expresses the
same "swapped" lines:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,73,67,6
_________
_________
2,68,61,7
_________
_________
1,69,55,14
__________
__________
2,71,59,12
__________
__________

See what's happening? The -k1n argument says to start parsing at field
1, but continue parsing until either the input is no longer numeric or
until the end of line is reached (even if it goes into field 2 or
beyond). Since commas are silently ignored in the en_US.UTF-8 locale
when parsing a number, sort is thus comparing the values 268617 and
1695514, and the sort was correct.

Now, try telling sort that it must parse a numeric field, but must END
the parse at the end of the first field (if not sooner due to end of
number):

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1,1n --debug
sort: using ‘en_US.UTF-8’ sorting rules
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

Or try using a locale where ',' is NOT part of a valid number:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | LC_ALL=C sort -t, -k1n --debug
sort: using simple byte comparison
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________


> 
> This produced the expected results: cut -f , -d 1-3 < weird.csv | sort -t ,
> -k 1n

Actually, you mean 'cut -d, -f 1-3' (you transposed while transferring
from the stackoverflow site to your email).  But yeah, when you truncate
to a smaller number, you are comparing different values (17367 is less
than 26861).

> 
> Using 'g' instead of 'n' also produces the expected results, but I'm not
> clear on what the difference is between 'g' and 'n'.

-n is specified by POSIX as parsing integers according to the current
locale's definition.  -g is a GNU extension, which says to parse
floating point numbers.  Apparently, in the en_US.UTF-8 locale, parsing
floating point stops at the first comma, while parsing integers does not:

$ printf '1,73,67,6\n2,68,61,7\n1,69,55,14\n2,71,59,12\n' \
 | sort -t, -k1g --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
1,69,55,14
_
__________
1,73,67,6
_
_________
2,68,61,7
_
_________
2,71,59,12
_
__________

I don't know why libc chose to make strtoll() ignore commas while
strtold() does not, when not in the C locale.

But at any rate, I hope I've demonstrated that the bug was in your usage
and not in sort.  So I'm closing this bug, although you should feel free
to add further comments or questions.  You may also want to read the FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021
[Hmm - we should update that FAQ to mention the --debug option]

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]