coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] added ability in sort to skip n number of lines for each


From: Jim Hester
Subject: Re: [coreutils] added ability in sort to skip n number of lines for each file
Date: Tue, 23 Nov 2010 10:57:46 -0500

Below I have an updated proper patch, it is quite a bit larger than my first, but should address all of the concerns from Assaf and Pádraig.

My main motivation here is not just to make this common operation less annoying, it was mostly for increased performance.  I made a test dataset of 10 files with 3 header lines each and 500,000 lines to sort, then ran sort by using head and tail as Pádraig suggests, and then again using my implemented header skip on an 8 core machine.  Larger files seem to show similar speed up as well.  I believe this speedup comes from the fact that the multithreaded sort is trying to read from the buffer faster than tail can write to the buffer.

>time { (head -q -n 3 test[0-9] | head -n 3; tail -q -n+4 test[0-9] | ./sort -n ) > out2; }

real    0m51.660s
user    2m0.324s
sys     0m4.115s

>time ./sort -n -l 3 test[0-9] > out

real    0m31.834s
user    2m17.775s
sys     0m3.981s
>diff out out2
>

2010/11/22 Pádraig Brady <address@hidden>
On 22/11/10 22:21, Pádraig Brady wrote:
> Perhaps something like:
>
> (head --no-header -n1 file.* | head -n1; tail --no-header -n+2 file.* | sort)
>
> I.E. add the --no-header option to suppress the ==> file name <== annotations
> which would allow using `head` and `tail` in general for this.

Of course this being useful, it's already supported:

(head -q -n1 file.* | head -n1; tail -q -n+2 file.* | sort)

cheers,
Pádraig

Attachment: sort_skip_lines_2.diff
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]