parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How should --onall work?


From: Ole Tange
Subject: How should --onall work?
Date: Thu, 26 May 2011 17:45:27 +0200

I have been convinced that GNU Parallel should have an --onall option.

       --onall (unimplemented)
                Run all the jobs on all computers given with --sshlogin. GNU
                parallel will log into --jobs number of computers in parallel
                and run one job at a time on the computer. The order of the
                jobs will not be changed, but some computers may finish
                before others.

I intend this:

  parallel --onall -S eos,iris '(echo {3} {2}) | awk \{print\ \$2}'
::: a b c ::: 1 2 3

to do:

  parallel -S eos '(echo {3} {2}) | awk \{print\ \$2}' ::: a b c ::: 1 2 3
  parallel -S iris '(echo {3} {2}) | awk \{print\ \$2}' ::: a b c ::: 1 2 3

In practise I believe this could be easily implemented by having GNU
Parallel call parallel like this:

  parallel -a /tmp/abc -a /tmp/123 -j1 -S eos '(echo {3} {2}) | awk
\{print\ \$2}'
  parallel -a /tmp/abc -a /tmp/123 -j1 -S iris '(echo {3} {2}) | awk
\{print\ \$2}'

where I simply put 'a\nb\nc\n' and '1\n2\n3\n' into /tmp/abc and
/tmp/123 respectively. As they are already being put into temporary
files then the change may be small. I believe this would work out
fine.

A small penalty is that if run n jobs in parallel and have 2n hosts,
it will do all the jobs for host1-n first and then all the jobs for
hostn-2n. It will not run the first job on all hosts first and then
the second.

- o -

I have a harder time figuring how to deal with stdin:

  cat | parallel --onall -S eos,iris

This should run whatever comes from cat on both eos and iris. While
the above is easy:

  cat | tee >(ssh eos) >(ssh iris) >/dev/null

it becomes harder if you have so many hosts (10000s) that you cannot
login to all of them at the same time.

Also this one is tricky as you have to keep the {n} working:

  cat | parallel --onall -S eos,iris '(echo {3} {2}) | awk \{print\
\$2}' :::: - ::: a b c ::: 1 2 3

Maybe the solution is to accept that we have to read all of stdin
first, put that in a file and use -a as above?

So the tricky one will be executed like:

  # Stuff everything from stdin into a file
  cat > /tmp/stdin
  # Call parallel for each host in parallel
  parallel -a /tmp/stdin -a /tmp/abc -a /tmp/123 -j1 -S eos '(echo {3}
{2}) | awk \{print\ \$2}' &
  parallel -a /tmp/stdin -a /tmp/abc -a /tmp/123 -j1 -S iris '(echo
{3} {2}) | awk \{print\ \$2}' &

The price will be that if you have a slow program generating the stdin
then that program has to finish before GNU Parallel can even begin
executing the jobs. Ideally GNU Parallel should start executing the
jobs that it already knows have to be run.

One way of solving that would be having a jobqueue for each sshlogin.
That, however, looks like a big change to the code.

- o -

People wanting to use GNU Parallel for running the same commands on a
lists of hosts can you please describe your situations, so the design
will work well. At the very least I need to know:

* number of hosts (can we just log in to all of them simultaneously?)
* number of commands to be run (is it just 1 or is it a script
generated on stdin?)
* is speed an issue? (would it be OK to ssh for each command?)
* how are the commands generated? (is it a fast program, so it is OK
to read everything before executing anything?)


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]