parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help in parallelizing bedtools


From: Ole Tange
Subject: Re: Help in parallelizing bedtools
Date: Fri, 7 Mar 2014 02:12:33 +0100

On Sun, Mar 2, 2014 at 4:37 PM, Stefano Capomaccio <capemaster@gmail.com> wrote:

> I'm a happy user of parallel 20140122

Great to hear. If you like GNU Parallel:

* Walk through the tutorial
(http://www.gnu.org/software/parallel/parallel_tutorial.html)
* Give a demo at your local user group/team/colleagues
* Post the intro videos and tutorial on Reddit/Diaspora*/forums/blogs/
Identi.ca/Google+/Twitter/Facebook/Linkedin/mailing lists
* Request or write a review for your favourite blog or magazine
* Invite me for your next conference

If you use GNU Parallel for research:

* Please cite GNU Parallel in you publications (use --bibtex)

If GNU Parallel saves you money:

* (Have your company) donate to FSF https://my.fsf.org/donate/

> but I'm stucked in a problem with the semaphore option.

Semaphore is slower than normal parallel mode and seems to have a race
condition if you run 100s of jobs in parallel.

> In the following bash code my intent is to run on several cores (specified
> by $numcore) an R script.
>
> for file in `ls $directory`
> do
>   sem -j"$numcore" R < rscript.R --slave --args $file $other_input
> $directory > "$file".gw.log
> done
> sem --wait

The above should work. I can, however, not test it, as you have not
provided enough information. Please follow the section REPORTING BUGS
in the man page:

* A complete example that others can run that shows the problem. This
should preferably be small and simple. A combination of yes, seq, cat,
echo, and sleep can reproduce most errors. If your example requires
large files, see if you can make them by something like seq 1000000 >
file or yes | head -n 10000000 > file. If your example requires remote
execution, see if you can use localhost - maybe using another login.

* The output of your example. If your problem is not easily reproduced
by others, the output might help them figure out the problem.

* Whether you have watched the intro videos
(http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1), walked
through the tutorial (man parallel_tutorial), and read the EXAMPLE
section in the man page (man parallel - search for EXAMPLE:).

If you suspect the error is dependent on your environment or
distribution, please see if you can reproduce the error on one of
these VirtualBox images:
http://sourceforge.net/projects/virtualboximage/files/

In this case I think it is dependent on your environment, so please
make an reproducible example on a virtual machine.

> This task has to be done 32 times on 10 cores.
>
> I have noticed that parallel spreads correctly the job over the desired
> cores but it seems that when the for exausts the files (the thirty files)
> does not wait until every job is done and the following lines of code are
> executed making you think that the analysis is done while there are some
> cores that are running.

With 'sem --wait' it sounds like an error.

> This is not convenient because I need the ouput of the 32 process to be
> parsed aftwerwards this step and I miss two of them avery time.
> Results are indeed correct but I cannot pipe this step.

A work around:

ls $directory | parallel -j"$numcore" R '<' rscript.R --slave --args
{} $other_input $directory '>' {}.gw.log

Also you might find --results useful. And you might even take a look
at --shebang-wrap:

       R:       #!/usr/bin/parallel --shebang-wrap /usr/bin/Rscript
--vanilla --slave


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]