parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dynamically changing remote servers list


From: Douglas A. Augusto
Subject: Re: Dynamically changing remote servers list
Date: Fri, 15 Aug 2014 21:59:39 -0300
User-agent: Mutt/1.5.23 (2014-03-12)

On 14/08/2014 at 09:30,
Ole Tange <ole@tange.dk> wrote:

> --sshloginfile already takes a file, so it will be natural to re-read
> that. Probably using this method:
> 
>     When a job finishes, and it is more than 1 second since we checked last:
>       Check if the file has changed modification time. If yes: re-read it.

Dear Ole,

Thanks for your reply. That seems to be a nice solution.

> Removing is, however, a completely different ballgame: What do you do
> about jobs currently running on the servers? Also there is no
> infrastructure to say: Don't start new jobs on this server and remove
> it when the last job completes. The easiest is probably to add a 'dont
> start new jobs' flag to the server-object, and leave the data
> structure in place. It will, however, cost a few cycles to skip the
> server every time a new job is started.
> 
> --filter-hosts does the "removal" by not adding the host in the first place.

In order to make GNU Parallel more resilient, particularly when running jobs on
remote servers over unreliable internet connections, I think it should be able
to detect when a server is "down" and when it is "back" again. This would be
like a dynamic "--filter-hosts". The likelihood of at least one server being
inaccessible (temporarily or definitively) increases quickly with the size of
the list of servers and processing time; in my opinion, having this kind of
feature would make GNU Parallel more robust and scalable. Currently, when a
server goes off-line GNU Parallel does not recognize this and tries over and
over to schedule jobs on it (and they all fail).

With respect to what to do with jobs currently running on the servers, I think
GNU Parallel should simply wait until they complete. If the user really wants
to kill them, he or she could do this manually (perhaps even using a second
instance of GNU Parallel to issue a "broadcast kill"). Alternatively this
decision could be left to the user via a user-defined parameter.


Best regards,

-- 
Douglas A. Augusto



reply via email to

[Prev in Thread] Current Thread [Next in Thread]