Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data

From:	Ole Tange
Subject:	Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Date:	Wed, 12 Jan 2011 22:54:55 +0100

---------- Forwarded message ----------
From: Cook, Malcolm <MEC@stowers.org>
Date: Wed, Jan 12, 2011 at 9:23 PM
Subject: RE: Splitting STDIN to parallel processes (map-reduce on
blocks of data)
To: Ole Tange <tange@gnu.org>

This will be hard to get right for all applications.

A few considerations::

You might borrow the semantics of the BSD `split` and `csplit`
commands to determine block boundaries, allowing:

       parallel --block 'split -l 100' # 100 line blocks

       parallel --block 'csplit "%^>%" "/^>/" "{*}"' # one fasta
record per block

The input file may have a 'header' (typically one line) that you
want/need to repeat for each block.

The outputs from each block may have a header which
typically/optionally should be removed from all but the first block
result (the reduce is not simply `cat`).

Since the process which splits input into blocks is highly application
dependent, you might consider providing a --blocks parameter which
specify a (tab?) delimited file holding line (byte?) offsets into the
BigFile.  This would allow creation of blocks using any method.  I
have been using your the --colsep option of parallel to this effect
(but without making consideration for remote machines), as follows:

# first, find the positions at which bigFile.tab should be broken into blocks.
# blockRanges does this (as called, starting with the second line).  It outputs
# tab delimited file in format "<block number> <start line> <end line>"
blockRanges -start=2 -blockSize=1000 bigFile.tab > bigFile_block.tab

# now, use analyze these blocks of bigFile in parallel (giving all
blocks a common header from line 1), and throw away the headers of all
but the first output
cat bigFile_block.tab | parallel -k --colsep '\t' 'perl -ne "if ((1 ..
1) || ({2} .. {3})) {print}" bigFile.tab | analyzeMe |  tail +$((1 ==
${PARALLEL_SEQ}?1:2))' > bigFile.out

This approach is working fine (on 20-core machine) for me.  Of course,
analyzeMe has to work with stdin/stdout

Cheers,

Malcolm Cook

[Prev in Thread]

Current Thread

[Next in Thread]

Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/11
- Message not available
  - Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange <=
    - Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/13
- Message not available
  - Message not available
    - Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/12
- Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/18
  - Re: Splitting STDIN to parallel processes (map-reduce on blocks of data), Ole Tange, 2011/01/19

Prev by Date: Re: Replacement string for process number
Next by Date: Fwd: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Previous by thread: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Next by thread: Re: Splitting STDIN to parallel processes (map-reduce on blocks of data)
Index(es):
- Date
- Thread