bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#15450: SORT failing on some lines


From: Eric Blake
Subject: bug#15450: SORT failing on some lines
Date: Mon, 23 Sep 2013 16:23:08 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130805 Thunderbird/17.0.8

tag 15450 needinfo
thanks

On 09/22/2013 08:28 PM, address@hidden wrote:

> While most items are alphabetically sorted, the following occurs (for
> example):
> 
> "Universe (1960 film)"
> "Universe"
> 
> "Yellow 2G"
> "Yellow"

It sounds like you might be falling foul of a FAQ:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

But to know that for sure, you need to provide more information: what
locale settings are you using, and does your locale treat punctuation as
insignificant when doing strcoll()?  Are you using LC_ALL=C to force C
locale sorting?

> 
> the lines are in the wrong order. My C++ program which searches the
> index expects that "Universe" comes before "Universe (1960 film)" when
> doing a string compare.
> 
> Interestingly, if I copy these problem lines into a separate text file
> and run SORT on them, it sorts correctly.
> I have tried every switch combination I can think of but the problem
> remains.

You didn't show the actual options you are trying, so it's hard to say
without more information.  Are those the full line that you are sorting,
or are you sorting something more like this

$ LC_ALL=en_US.UTF-8 sort -t/ foo
Yellow 2G/1
Yellow/2
$ LC_ALL=en_US.UTF-8 sort -t/ -k1,1 foo
Yellow/2
Yellow 2G/1

Note how in the en_US locale, which ignores punctuation, I was able to
get a different sort order depending on whether I remembered to
terminate the sort key at the separator, vs. letting it strcoll() on the
full line.

Have you played with the --debug option, to make sure you are sorting on
what you THINK you should be sorting on?

> I am wondering if it is something to do with the size of the file I am
> trying to sort. 605 megabytes, about 10,000,000 lines of text. Again,
> most of the lines are sorted correctly, but some (and I haven't checked
> exactly how many, but am finding them at random) are not.

Most likely, the size of the file probably has nothing to do with it.
To guarantee it is not a bad merge when sort uses multiple files, rerun
your command with 'sort --parallel=1 $your_options...' to ensure that
there are no temporary files to be merged (if there IS a bug with how
temporaries are merged, we definitely want to fix that; it would show up
with --parallel larger than 1).

Again, I suspect it is in your locale or command line, but without
enough details I can't prove that.  So I'll leave this bug open while
waiting for more details.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]