bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] split: --chunks option


From: Chen Guo
Subject: Re: [PATCH] split: --chunks option
Date: Sat, 28 Nov 2009 11:38:07 -0800 (PST)

Hi Padraig,

> I do think --number is more general than --chunk as it allows you to specify 

> only 1 number
> to get the behaviour described above. Also I notice that FreeBSDs split 
> recently
> got a '-n chunk_count' option, so it would be good to maintain compat with 
> that 
> if possible.
> 
I read the FreeBSD source. It's interesting that the Berkeley gave the copy 
right
to UC Regents, who just skyrocketed my tuition. Anyhow...

More on topic, their --number option is actually quite trivial; they get size = 
st_size/n
and proceed like it's --bytes=size. In a sense, this chunks option can actually 
be
seen as an extension to their --number option.

I think what I'll end up doing is, implement their --number option, outputting 
the chunks
to files. Then extend it to support --number=n/tot, which outputs to stdout.

Then for delineation by newlines, I'll call it something like --number-lines=n, 
outputting
all chunks with split's cwrite to files, and what I have now 
--number-lines=n/tot, which
extracts a chunk to stdout.

> We also need to decide how to select between text and binary modes for 
> --number.
> Note reading from non seekable input complicates things.
> For binary data I don't see how one could support --number.
> 

So under this scheme then it'd be up to the user whether to use --number or
--number-lines. --number of course supports binary, since it's byte
delineation rather than line delineation.

Lastly, I tested using this with sorting. As expected, it's not faster. This is 
done on
gcc 14, rand is a million line ASCII file generated by gensort. Like I said, 
I'll try
to implement the same concept, but internally within sort so we're free of the 
pipe
overhead, and see how that goes.

address@hidden:~/testing$ time ./sortgl --threads=8 rand > /dev/null

real    0m1.820s
user    0m5.236s
sys    0m0.168s
address@hidden:~/testing$ time sort -m <(./split -c1,8 rand | sort) <(./split 
-c2,8 rand | sort) <(./split -c3,8 rand | sort) <(./split -c4,8 rand | sort) 
<(./split -c5,8 rand | sort) <(./split -c6,8 rand | sort) <(./split -c7,8 rand 
| sort) <(./split -c8,8 rand | sort) > /dev/null

real    0m2.198s
user    0m5.324s
sys    0m0.440s

And lastly you guys probably wont hear back from me for a couple of weeks on
anything. it's the end of the quarter at UCLA and that means fun projects and 
even 
more fun finals.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]