RE: Parallelising grep

Assuming your shell is bash....

With this exported function

function slice {

# PURPOSE: After an optional -h lines of header (which are echoed

# unless supressed with <-sh>), echo every <-n>th line (default:

# every 1 line) starting with the <-m>th (counting from 1, starting

# with the first line after the header, default: starting with the

# <n-1>th line.)

# AUTHOR: malcolm_cook@stowers.org

# EXAMPLE: slice -h=1 -sh -n=5 foo.tab > foo_every_fifth_line_after_the_one_line_header.tab

# set -e ;

perl -snwe 'BEGIN{our $n||=1; our $m=($n) unless defined($m); $m-=1; our $h||=0; die "required: m < n" unless $m < $n; our $sh} print $_ if (($. > $h ) ? (($. -1 - $h) % $n == $m) : ! $sh)' -- $@

}

export -f slice

...you can create a parallel jobs where each job greps a slice of in.bam

You would pass parallels {#} as the value for –m and the same value you pass as –j to parallel as the value for –n

You’ll probably need to use parallels –q and have each job call bash.

The following is untested.

parallel –j 10 –q ‘bash –c “samtools view in.bam | slice –n=10 –m={#} | bash –c fgrep -w -f read.ids”’ > alignments.txt

The output will have the slices interwoven.

From: parallel-bounces+mec=stowers.org@gnu.org [mailto:parallel-bounces+mec=stowers.org@gnu.org] On Behalf Of Nathan S. Watson-Haigh
Sent: Friday, August 09, 2013 12:54 AM
To: parallel@gnu.org
Subject: Parallelising grep

I have a SAM/BAM file and I’d like to grep for alignments of certain reads IDs. I have the read ID strings in another file. I’m currently doing this with:

$ samtools view in.bam | fgrep -w -f read.ids > alignments.txt

Is it possible to parallelise the grep by having each grep process a different subset of read iDs from the read.ids file? Or is there an alternative way to parallelise this which I have overlooked?

Cheers,

Nathan

Nathan S. Watson-Haigh, PhD

Research Fellow in Bioinformatics

Australian Centre for Plant Functional Genomics (ACPFG)

School of Agriculture, Food and Wine

University of Adelaide Waite Campus

Plant Genomics Centre

Hartley Grove, Urrbrae

SA 5064

Phone: +61 8 8313 2046

Mobile: +61 438 711 615

Skype: nathanhaigh

Email: nathan.haigh@acpfg.com.au

Web: http://www.acpfg.com.au/bioinformatics

LinkedIn http://www.linkedin.com/profile/view?id=114191748

Github: https://github.com/nathanhaigh/

https://gist.github.com/nathanhaigh/

Twitter: @watsonhaigh

@BIG_SA1

RID: B-9833-2008

ResearchGate: Nathan_Watson-Haigh

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

From:	Cook, Malcolm
Subject:	RE: Parallelising grep
Date:	Fri, 9 Aug 2013 16:26:50 +0000