bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8067: sort fails to sort completely, due to "similar" keys.


From: Bob Harris
Subject: bug#8067: sort fails to sort completely, due to "similar" keys.
Date: Thu, 17 Feb 2011 16:47:41 -0500


On Feb 17, 2011, at 4:30 PM, Eric Blake wrote:

On 02/17/2011 01:46 PM, Bob Harris wrote:
Howdy,

(note: I know I should give you version information with this, but (1) I
am not sure that this message will be read by anyone, and (2) I think
the problem probably transcends versions. If I get a response and the
actual version is important, I will take the time to find it.)

Thanks for the report, and you are correct that your issue transcends
versions.  However, if you use coreutils 8.6 or newer (the latest is
8.10), then the new --debug option would have helped you.


I have a file of genomic short sequence info in which it so happens that
two of my sort key values are similar.  The two keys are
   HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
   HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
As you can see, these are identical if one removes the colons.

Which sounds like exactly what sort does when you are sorting in the
en_US.UTF-8 locale.

I have tried several different options but none seem to work. -d seems
to be the default, and it has the behavior indicated above.  -n fails
completely. -g also fails. Reading the man page, I don't see any other
options to control the comparison function.

Then you missed this part (in the sort man page, which is in turn
generated from 'sort --help'):

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

I understand *why* -d considers these two keys equal.  What I don't
understand is why there is no option that says "order them
lexicographically".

That option is your set of locale-specific environment variables.  Why
it's not an explicit option is due to historical accident (that's the
way POSIX specified it).  Maybe GNU sort should add a
--collate-locale=... option as an extension that overrides LC_ALL, but
that seems a bit like bloat, and doesn't buy much over using the
standardized means of choosing collation sequencing.


Is there a hidden sort option that will do what I need?

Yep - try 'LC_ALL=C sort ...' to see the difference.

I'm pretty sure I'm not the first person to run into this problem.

You're not.  It's a FAQ:

http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

--
Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org


Thanks Eric, for the informative reply, and the FAQ link.

That makes sense (in the sense that I can see how to correct it). Currently my LC_ALL is set to nothing. Some additional googling after sending my previous message revealed that this is also affected by LANG (my default is en_US.UTF-8 and LANG=C gives the desired result). I was in the process of investigating what else I might break my fiddling with LANG when your message arrived.

So I'll investigate LC_ALL instead, and see if there are potentially any negative side effects, so I'll (hopefully) know what trade off I am making (if any).

Thanks again,
Bob H

P.S. You're right, I missed the warning in the man page. I was diligently looking through the options for one that would do what I needed, and didn't realize there were other descriptive notes below the options.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]