parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallel with sed group capture


From: Matt Oates (Home)
Subject: Re: Parallel with sed group capture
Date: Wed, 8 May 2013 14:20:26 +0100

Dear Carlos,

You just need to quote the sed command so:

cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
--pipe sed 's#^\(@.*\)_\([12]\).*#\1/\2#'

becomes:

cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel
--pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"


You might also want to define how the FASTQ records are separated
which is problematic if you have reads from anything other than
Illumina 1.5+ since the quality score can include @ symbols at the
start of a line. You could do something like the following to split
the pipe so that whole FASTQ records go to each job:

parallel --recstart='^@' --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"

something like the following might be more appropriate though:

parallel -N 4 --pipe "sed 's#^\(@.*\)_\([12]\).*#\1/\2#'"

This will tell parallel to take 4 lines at a time per job, just
increase to multiples of four to send that number of FASTQ records to
each job. Obviously with your current sed it doesn't actually matter
that you have one FASTQ record per job but it might be important in
the future.

Best Wishes,
Matt.

---
http://blog.mattoates.co.uk
http://www.mattoates.co.uk


On 8 May 2013 11:01, Carlos Pérez Cantalapiedra
<cpcantalapiedra@gmail.com> wrote:
> Hello everyone,
>
> I am new to this list and to the parallel command. I hope answer to next
> question is not too obvious, but enough to get some advice :)
>
> I have to process a big file, and have been reading about parallel command
> to try to use more than 1 core processor when using sed, sort and so on. So
> I first wanted to change first line of every four (because of naming
> conventions of this kind of file - FastQ format).
>
> For example, this would be a group of four, and I want to modify the first
> line:
>
>     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4
>
>     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>     GCGAGAGAAT
>     +
>     GHHHHHHHHHH
>
> With the next command I have the work done:
>
>     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | sed
> 's#^\(@.*\)_\([12]\).*#\1/\2#'
>
>     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289/1
>     GCGAGAGAAT
>     +
>     GHHHHHHHHHH
>
> However, when using parallel it seems that is not recognizing the group
> capture brackets:
>
>     cat sbcc073_pcm_ill_all.musket_default.fastq | head -4 | parallel --pipe
> sed 's#^\(@.*\)_\([12]\).*#\1/\2#'
>
>     @HWUSI-EAS1752R:29:FC64CL3AAXX:8:65:16525:4289_1:N:0:ACTTGA
>     GCGAGAGAAT
>     +
>     GHHHHHHHHHH
>
> When removing backslashes or using sed -r the command is telling me:
>
>     /bin/bash: -c: line 3: syntax error near unexpected token `('
>     /bin/bash: -c: line 3: `             (cat /tmp/60xrxvCIRX.chr; rm
> /tmp/60xrxvCIRX.chr; cat - ) | (sed s#^(@.*)_([12]).*#\1/\2# );'
>
> Could anyone put some light on this?
>
> thank you very much



reply via email to

[Prev in Thread] Current Thread [Next in Thread]