[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: fault tolerance, retry task on different node, recovery orientation?
From: |
Ole Tange |
Subject: |
Re: fault tolerance, retry task on different node, recovery orientation? |
Date: |
Fri, 30 May 2014 00:17:04 +0200 |
As mentioned in the man page: Computers will only be reused if the
number of retries > number of computers (or more correctly:
sshlogins).
The order in which the computer is tested is based on the order values
are extracted from a Perl hash using 'values'. I am still puzzled why
you believe this order will be important. I would believe it is much
more important to know that a computer on which the job has failed
will not be chosen unless number of retries > number of sshlogins.
/Ole
On Thu, May 29, 2014 at 9:27 PM, Mitchell Wyle <mfw@wyle.org> wrote:
> Hi Ole,
>
> Thanks for the quick reply. I meant, if I have 10 SSHLOGIN computers how
> does parallel choose on which one it will dispatch the next job and to which
> one it will dispatch a failed job that it is retrying. The selection method
> it uses for selecting which computer when it does what the man page says:
> "retry it on another computer." round-robin is better than random
> (zookeeper) and better than "least loaded."
>
> Thanks again.
>
>
>
>
> On Thu, May 29, 2014 at 12:20 PM, Ole Tange <ole@tange.dk> wrote:
>>
>> On Thu, May 29, 2014 at 8:54 PM, Mitchell Wyle <mfw@wyle.org> wrote:
>> > Cool! I shall try simple --retries and verify it works. Does it
>> > "round
>> > robin" the tries? Thanks!
>>
>> No. It does what it says in the man page:
>>
>> --retries n
>> If a job fails, retry it on another computer. Do
>> this n times. If there are fewer than n computers
>> in --sshlogin GNU parallel will re-use the
>> computers. This is useful if some jobs fail for no
>> apparent reason (such as network failure).
>>
>> Why do you think it would do something else than what it says in the man
>> page?
>>
>>
>> /Ole
>
>