parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data


From: Ole Tange
Subject: Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Date: Wed, 12 Jan 2011 22:55:21 +0100

On Wed, Jan 12, 2011 at 9:23 PM, Cook, Malcolm <MEC@stowers.org> wrote:

> This will be hard to get right for all applications.

In that case I would rather make a simple engine that you have to add
your own pre- and post-processor to.

> A few considerations::
>
> You might borrow the semantics of the BSD `split` and `csplit` commands to 
> determine block boundaries, allowing:
>
>        parallel --block 'split -l 100' # 100 line blocks
>
>        parallel --block 'csplit "%^>%" "/^>/" "{*}"' # one fasta record per 
> block
>
>
> The input file may have a 'header' (typically one line) that you want/need to 
> repeat for each block.

This should be done by the pre processor. It ought to be fairly easy
to do for the separate case:

cat hg18.fa | perl -pe 'if(1 .. 1) { $header = $_; } else { /^>/ and
print "Newrecord$header" }' |  parallel --record-sep Newrecord
--delete-record-sep analyse | perl -ne 'if(1 .. 1) {$header = $_;
print;} /^$header$/ or print;'

> The outputs from each block may have a header which typically/optionally 
> should be removed from all but the first block result (the reduce is not 
> simply `cat`).

See above.

> Since the process which splits input into blocks is highly application 
> dependent, you might consider providing a --blocks parameter which specify a 
> (tab?) delimited file holding line (byte?) offsets into the BigFile.  This 
> would allow creation of blocks using any method.

That looks to me like premature optimization. Why not just stream
BigFile and chop it up as we go?

>  I have been using your the --colsep option of parallel to this effect (but 
> without making consideration for remote machines), as follows:

That is very creative use of --colsep :-)

> # first, find the positions at which bigFile.tab should be broken into blocks.
> # blockRanges does this (as called, starting with the second line).  It 
> outputs
> # tab delimited file in format "<block number> <start line> <end line>"
> blockRanges -start=2 -blockSize=1000 bigFile.tab > bigFile_block.tab
>
> # now, use analyze these blocks of bigFile in parallel (giving all blocks a 
> common header from line 1), and throw away the headers of all but the first 
> output
> cat bigFile_block.tab | parallel -k --colsep '\t' 'perl -ne "if ((1 .. 1) || 
> ({2} .. {3})) {print}" bigFile.tab | analyzeMe |  tail +$((1 == 
> ${PARALLEL_SEQ}?1:2))' > bigFile.out

If your blockRanges program simply inserted a unique record marker and
piped it into parallel, then GNU Parallel could split on the marker
and optionally remove the marker.

> This approach is working fine (on 20-core machine) for me.  Of course, 
> analyzeMe has to work with stdin/stdout

I would like to support situations where analyzeMe does not work with
stdin/stdout. Maybe something like:

... | parallel [...] analyzeMe --input-file {in} --output-file {out}

If {in} is found, GNU Parallel will substitute that with the name of a
tempfile containing the block.
If {out} is found, GNU Parallel will substitute that with the name of
an empty tempfile and GNU Parallel will cat {out}. Otherwise stdout
will be output.
Then {in} and {out} will be removed.

Would this work for you? Ideas for improvements?


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]