[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] added ability in sort to skip n number of lines for each
From: |
Assaf Gordon |
Subject: |
Re: [coreutils] added ability in sort to skip n number of lines for each file |
Date: |
Mon, 22 Nov 2010 15:20:07 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101030 Icedove/3.0.10 |
Hello Jim and all,
On 11/18/2010 11:36 AM, Jim Hester wrote:
A common problem when sorting files stems from the file containing 1
or more header lines, which should not be sorted.
I'm also very much interested in a "header-aware" sort operation.
However, I've found "sort" to require slightly more complicated solution
to have a stable "sort" operation, due to internal implementation of
splitting files and merging them later.
I have made a simple patch to
implement this feature, which I have attached to this email.
At the very list, I think that the following lines in your patch:
+ case 'l':
+ specify_nline_skip(oi,c,optarg);
+ break;
Should be changed to:
+ case 'l':
+ nline_skip = specify_nline_skip(oi,c,optarg);
+ break;
Otherwise the "nline_skip" variable stays at 0 and no lines are skipped.
But, your patch works only as long as all the sorting is done in-memory,
and never goes into the temporary files + merging flow.
Here's a demonstration of the problem:
1. Create a file containing numbers from 1-1M, three times, with a
header line.
$ (echo "42_header" ; seq 1 1000000 ; seq 1 1000000 ; seq 1 1000000)
> input_with_header.txt
2. Sort with regular (unpatched) sort, all is well (obviously, the
header line will be sorted as a number, not appear as the first line):
$ sort -n input_with_header.txt | head -n 5
1
1
1
2
2
3. sort with regular sort, limit memory to 5M (forcing sort to use
temporary files), all is still well:
$ sort -S 5M -n input_with_header.txt | head -n 5
1
1
1
2
2
4. Sort with your patched sort, sorting done in-memory (because the file
is about 20MB and the default buffer is 500MB, IIRC) - all is well, the
header line is maintained as first line:
$ sort-header -l 1 -n input_with_header.txt | head -n 5
42_header
1
1
1
2
5. But sort with your patched sort, limit memory to 5MB (forcing
temporary files + merging), the output is incorrect:
42_header
1
2
3
4
----
I do not mean to discourage you, as I find the header sorting (and
joining) to be much needed. But I suspect a correct implementation will
be more complicated.
As a work-around, we're using a shell script that accepts most (not all)
of sort's options, "steals" the first couple of header lines, then pass
the rest of the output to sort.
Unlike Padraig's suggested solution, this script supports sorting from a
pipe/STDIN.
This is the script:
http://cancan.cshl.edu/labmembers/gordon/files/sort-header
It's far from complete, and if anyone has suggestion or comments about
it - they are welcomed.
(It also assumes the input is tab-delimited, not white-space delimited,
which is fine for my purposes).
regards,
-gordon
Re: [coreutils] added ability in sort to skip n number of lines for each file,
Assaf Gordon <=