parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: feature suggestion: --preserve-blocking-factor


From: Cook, Malcolm
Subject: RE: feature suggestion: --preserve-blocking-factor
Date: Sat, 18 Feb 2017 04:39:14 +0000

Hi Ole,

I don't think my needs were clear.  

I know you are bioinformatics savvy and are familiar with bedtools, so let me 
cast my example in terms of bedtools.

I have a huge sorted bedfile, my.bed, that I want to pipe into bedtools merge 
(http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)

As required, it is sorted already.

I could 

        cat my.bed | parallel -j10 --pipe --block 50M bedtools merge

but the blocks that my.bed get broken by parallel into might not keep together 
the chromosomes, but this is required for the merge to be correct.

So I am looking for a means to instruct parallel that some ranges of records 
must stay together within a block.

Perhaps you have another suggestion for my situation using existing parallel 
capabilities...

Thanks,

Malcolm




 > -----Original Message-----
 > From: ole.tange@gmail.com [mailto:ole.tange@gmail.com] On Behalf Of
 > Ole Tange
 > Sent: Friday, February 17, 2017 3:01 PM
 > To: Cook, Malcolm <MEC@stowers.org>
 > Cc: parallel@gnu.org
 > Subject: Re: feature suggestion: --preserve-blocking-factor
 > 
 > On Thu, Feb 16, 2017 at 7:16 PM, Cook, Malcolm <MEC@stowers.org>
 > wrote:
 > 
 > > When using the --spreadstdin option, it may be desirable to ensure that
 > the blocks "keep together" certain blocks of data.
 > 
 > Yes. We use --recend --recstart for that.
 > 
 > > For example the input may be sorted on column 3, and it may be the case
 > that all lines having the same value for column 3 must be processed
 > together.
 > 
 > So the record depends on column 3 having the same value.
 > 
 > Parsing a CSV-file is expensive if it has to do it correctly (E.g.
 > values with tabs, quotes, and newlines). I do not see that becoming
 > part of GNU Parallel.
 > 
 > So how do you deal with the column issue?
 > 
 > Let us use this as an example:
 > 
 >   paste <(seq 105) <(parallel yes {}'|head -n {#}' ::: {a..n}) <(seq
 > 105 | shuf) > example
 > 
 > We want to group this by column 2, so all consecutive lines with the
 > same column 2 will be treated as a single record and not be split.
 > However, it will be OK to join multiple records.
 > 
 > We will make a small program to insert a record separator. This has to
 > be a string not found in the file. Here I have chosen '\0' but it
 > could be "p-O-P-p-y i'M poPpY", $(mmencode /dev/urandom|head), or
 > $(mktemp).
 > 
 >   cat example | perl -ape '$F[1] ne $old and print "\0"; $old = $F[1]'
 > 
 > Now it is suddenly trivially simple to tell GNU Parallel to group the
 > records together and remove the record separator:
 > 
 >   parallel --recend '\0' --rrs --pipe --block 200 wc
 > 
 > We might need something for --pipepart, so you can feed in potential
 > split positions, but you would still have to write the program that
 > finds the positions yourself.
 > 
 > 
 > /Ole

reply via email to

[Prev in Thread] Current Thread [Next in Thread]