[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#14988: sort enhancement request

From: Eric Blake
Subject: bug#14988: sort enhancement request
Date: Tue, 30 Jul 2013 16:33:55 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7

tag 14988 needinfo

On 07/30/2013 02:51 PM, Danny Nicholas wrote:
> Hi guys,

[can you convince your mailer to wrap long lines?]

> I am presently using version 7.1 on a Solaris box.  I downloaded 8.21 and 
> really love the improvement in speed (almost 50% in some tests).  I am 
> looking to replace the commercial product NSORT and would like this feature 
> in the source instead of a wrapper.  If I have a file
> XXXX300001XXXX
> XXXX300002XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300003XXXX
> XXXX300004XXXX
> XXXX300005XXXX
> XXXX300006XXXX
> XXXX300007XXXX

As written, your example is already sorted in the same order as written,
and with no other distinguishing features on the line, you haven't
proven that sort isn't outputting lines in the order you want.  I also
can't tell if the XXXX represent the actual bytes you are sorting, or if
you meant them as placeholders for a sanitized version of your actual
data set.  You'll need to give as an actual example of lines that are
sorted differently by nsort and GNU sort, and the command line options
you attempted for GNU sort, before we can tell you what to try next.

> NSORT keeps the 4 300003 records together in entry sequence.   My present 
> work-around is to use a Python script that reads in the whole file and 
> creates a pseudo-key that is 30000X plus an 8 digit sequence number (I 
> process millions of records).  What I am thinking of is an -es 
> (--entry-sequence) that would add a hidden -k to process on this internal 
> sequence.  If I figure out how to do this on my own, I will submit it to you.

Short options must be one letter long; writing your proposed 'sort -es'
would be the same as 'sort -e -s'.  Also, we are reluctant to burn short
options; these days, it's better to add a long option only, until it
proves its popularity, so that we don't collide with any future
standardized short options.

It SOUNDS like you are merely asking for a stable sort option.  Have you
tried the -s/--stable option?  That effectively adds an invisible key of
last resort that says if two lines otherwise compare equal, sort them so
that the line occurring first in input also occurs first in output.

At any rate, I'm marking this bug as 'needinfo' so that we can get more
feedback on whether --stable already meets your needs, or at least so we
can get a test case that we can play with to see what you are really
asking for.

Also, have you played with 'sort --debug'?  It shows you a lot more
details on EXACTLY what sort is looking at.  For example, I am able to
do a numeric sort on JUST the 6 digits in between the XXXX fillers of
the example you listed:

$ printf 'XXXX300002XXXX\nXXXX300001XXXX\n' \
   | LC_ALL=C sort --debug -k1.5,1.10n -s
sort: using simple byte comparison

> CONFIDENTIALITY:  This email (including any attachments) may contain 
> confidential,

Sorry, but this disclaimer is unenforceable on publicly archived lists.
 It is considered poor netiquette to use your employers email if they
insist on adding this on your behalf, and you may be better off sending
the mail from a personal account.

Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]