[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data
From: |
Ole Tange |
Subject: |
Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data) |
Date: |
Wed, 12 Jan 2011 22:55:21 +0100 |
On Wed, Jan 12, 2011 at 9:23 PM, Cook, Malcolm <MEC@stowers.org> wrote:
> This will be hard to get right for all applications.
In that case I would rather make a simple engine that you have to add
your own pre- and post-processor to.
> A few considerations::
>
> You might borrow the semantics of the BSD `split` and `csplit` commands to
> determine block boundaries, allowing:
>
> parallel --block 'split -l 100' # 100 line blocks
>
> parallel --block 'csplit "%^>%" "/^>/" "{*}"' # one fasta record per
> block
>
>
> The input file may have a 'header' (typically one line) that you want/need to
> repeat for each block.
This should be done by the pre processor. It ought to be fairly easy
to do for the separate case:
cat hg18.fa | perl -pe 'if(1 .. 1) { $header = $_; } else { /^>/ and
print "Newrecord$header" }' | parallel --record-sep Newrecord
--delete-record-sep analyse | perl -ne 'if(1 .. 1) {$header = $_;
print;} /^$header$/ or print;'
> The outputs from each block may have a header which typically/optionally
> should be removed from all but the first block result (the reduce is not
> simply `cat`).
See above.
> Since the process which splits input into blocks is highly application
> dependent, you might consider providing a --blocks parameter which specify a
> (tab?) delimited file holding line (byte?) offsets into the BigFile. This
> would allow creation of blocks using any method.
That looks to me like premature optimization. Why not just stream
BigFile and chop it up as we go?
> I have been using your the --colsep option of parallel to this effect (but
> without making consideration for remote machines), as follows:
That is very creative use of --colsep :-)
> # first, find the positions at which bigFile.tab should be broken into blocks.
> # blockRanges does this (as called, starting with the second line). It
> outputs
> # tab delimited file in format "<block number> <start line> <end line>"
> blockRanges -start=2 -blockSize=1000 bigFile.tab > bigFile_block.tab
>
> # now, use analyze these blocks of bigFile in parallel (giving all blocks a
> common header from line 1), and throw away the headers of all but the first
> output
> cat bigFile_block.tab | parallel -k --colsep '\t' 'perl -ne "if ((1 .. 1) ||
> ({2} .. {3})) {print}" bigFile.tab | analyzeMe | tail +$((1 ==
> ${PARALLEL_SEQ}?1:2))' > bigFile.out
If your blockRanges program simply inserted a unique record marker and
piped it into parallel, then GNU Parallel could split on the marker
and optionally remove the marker.
> This approach is working fine (on 20-core machine) for me. Of course,
> analyzeMe has to work with stdin/stdout
I would like to support situations where analyzeMe does not work with
stdin/stdout. Maybe something like:
... | parallel [...] analyzeMe --input-file {in} --output-file {out}
If {in} is found, GNU Parallel will substitute that with the name of a
tempfile containing the block.
If {out} is found, GNU Parallel will substitute that with the name of
an empty tempfile and GNU Parallel will cat {out}. Otherwise stdout
will be output.
Then {in} and {out} will be removed.
Would this work for you? Ideas for improvements?
/Ole