parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Parallelising grep


From: Nathan S. Watson-Haigh
Subject: RE: Parallelising grep
Date: Mon, 12 Aug 2013 03:50:16 +0000

Hi Ole,

The number of lines (reads) in reads.ids is ~9 million. The number of alignment 
lines in the SAM/BAM file is ~372,281,262.

Cheers,
Nathan


-----Original Message-----
From: ole.tange@gmail.com [mailto:ole.tange@gmail.com] On Behalf Of Ole Tange
Sent: Saturday, 10 August 2013 7:05 AM
To: Nathan S. Watson-Haigh
Cc: parallel@gnu.org
Subject: Re: Parallelising grep

On Fri, Aug 9, 2013 at 7:53 AM, Nathan S. Watson-Haigh 
<nathan.haigh@acpfg.com.au> wrote:
>
> I have a SAM/BAM file and I'd like to grep for alignments of certain 
> reads IDs. I have the read ID strings in another file. I'm currently 
> doing this
> with:
>
> $ samtools view in.bam | fgrep -w -f read.ids > alignments.txt

It will help if we get some idea of the size of the bam and ids, so give the 
output for:

$ samtools view in.bam | wc
$ wc read.ids
$ samtools view in.bam | fgrep -w -f read.ids | wc

Based on no information I would do split ids into a chunk per cpu:

$ parallel --round-robin --pipe --block 1k cat ">"id.{#}

And then run one per CPU:

$ parallel "samtools view in.bam | fgrep -w -f {}" ::: id.* > alignments.txt


/Ole

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________



reply via email to

[Prev in Thread] Current Thread [Next in Thread]