parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallel Digest, Vol 52, Issue 10


From: Mitchell Wyle
Subject: Re: Parallel Digest, Vol 52, Issue 10
Date: Sat, 16 Aug 2014 09:12:30 -0700

Here are some other ideas to consider for flexibly and dynamically adding / removing servers:

Consider implementing what Hadoop calls "speculative execution," where you send the same job to two or more servers and the first to complete the job wins.
Consider using aggressive timeouts for each job -- keep the jobs small and schedule very many of them to run; don't wait long for an individual one to be considered a failure.
Consider "heart beats" of some kind where parallel on remote servers respond to the parallel dispatching jobs that they are available


On Sat, Aug 16, 2014 at 9:00 AM, <parallel-request@gnu.org> wrote:
Send Parallel mailing list submissions to
        parallel@gnu.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://lists.gnu.org/mailman/listinfo/parallel
or, via email, send a message with subject or body 'help' to
        parallel-request@gnu.org

You can reach the person managing the list at
        parallel-owner@gnu.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Parallel digest..."


Today's Topics:

   1. Re: Dynamically changing remote servers list (Douglas A. Augusto)
   2. Re: Dynamically changing remote servers list (Ole Tange)
   3. Re: Dynamically changing remote servers list (Achim Gratz)


----------------------------------------------------------------------

Message: 1
Date: Fri, 15 Aug 2014 21:59:39 -0300
From: "Douglas A. Augusto" <daaugusto@gmail.com>
To: "parallel@gnu.org" <parallel@gnu.org>
Subject: Re: Dynamically changing remote servers list
Message-ID: <20140816005938.GA2845@phenom>
Content-Type: text/plain; charset=utf-8

On 14/08/2014 at 09:30,
Ole Tange <ole@tange.dk> wrote:

> --sshloginfile already takes a file, so it will be natural to re-read
> that. Probably using this method:
>
>     When a job finishes, and it is more than 1 second since we checked last:
>       Check if the file has changed modification time. If yes: re-read it.

Dear Ole,

Thanks for your reply. That seems to be a nice solution.

> Removing is, however, a completely different ballgame: What do you do
> about jobs currently running on the servers? Also there is no
> infrastructure to say: Don't start new jobs on this server and remove
> it when the last job completes. The easiest is probably to add a 'dont
> start new jobs' flag to the server-object, and leave the data
> structure in place. It will, however, cost a few cycles to skip the
> server every time a new job is started.
>
> --filter-hosts does the "removal" by not adding the host in the first place.

In order to make GNU Parallel more resilient, particularly when running jobs on
remote servers over unreliable internet connections, I think it should be able
to detect when a server is "down" and when it is "back" again. This would be
like a dynamic "--filter-hosts". The likelihood of at least one server being
inaccessible (temporarily or definitively) increases quickly with the size of
the list of servers and processing time; in my opinion, having this kind of
feature would make GNU Parallel more robust and scalable. Currently, when a
server goes off-line GNU Parallel does not recognize this and tries over and
over to schedule jobs on it (and they all fail).

With respect to what to do with jobs currently running on the servers, I think
GNU Parallel should simply wait until they complete. If the user really wants
to kill them, he or she could do this manually (perhaps even using a second
instance of GNU Parallel to issue a "broadcast kill"). Alternatively this
decision could be left to the user via a user-defined parameter.


Best regards,

--
Douglas A. Augusto



------------------------------

Message: 2
Date: Sat, 16 Aug 2014 11:31:33 +0200
From: Ole Tange <ole@tange.dk>
To: "Douglas A. Augusto" <daaugusto@gmail.com>
Cc: "parallel@gnu.org" <parallel@gnu.org>
Subject: Re: Dynamically changing remote servers list
Message-ID:
        <CA+4vN7yx-9dU1BHPY8gZb=c8kEdgheOEm2ET1z1eAiFgh0AJ9A@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On Sat, Aug 16, 2014 at 2:59 AM, Douglas A. Augusto <daaugusto@gmail.com> wrote:
> On 14/08/2014 at 09:30,
> Ole Tange <ole@tange.dk> wrote:
>
>> Removing is, however, a completely different ballgame: What do you do
>> about jobs currently running on the servers? Also there is no
>> infrastructure to say: Don't start new jobs on this server and remove
>> it when the last job completes. The easiest is probably to add a 'dont
>> start new jobs' flag to the server-object, and leave the data
>> structure in place. It will, however, cost a few cycles to skip the
>> server every time a new job is started.
>>
>> --filter-hosts does the "removal" by not adding the host in the first place.
>
> In order to make GNU Parallel more resilient, particularly when running jobs on
> remote servers over unreliable internet connections, I think it should be able
> to detect when a server is "down" and when it is "back" again. This would be
> like a dynamic "--filter-hosts".

I have been thinking along the same lines, but have been unable to
find an easy way of doing that in practice.

Here are some of the problems:

There is only one valid test to see if a machine is up and that is by
doing a ssh to run a command (such as /bin/true). We cannot assume the
the hostname given is known to DNS: I have several host aliases that
are only defined in my .ssh/config and which are behind a firewall
that you first have to log into. SSH works fine, but ping would fail
miserably.

GNU Parallel is (as crazy as it sounds) mostly serial. So if we need
to run a test before starting a new job, then all other jobs will be
delayed.

Should we test if a server is down before running spawning a new job?
If the jobs are long, then the added time for the test might not be
too bad. But if the jobs are short then this will add considerable
time to running the job. And if the server is dead, we will have to
wait for a timeout - delaying jobs even further.

We can assume that the server is up, if a job completes without error.
If there is an error, we cannot tell whether the job returned an error
or if ssh failed (i.e. the server is down). But if jobs fail it is
clearly an indication that the server could be down. So the test could
be done here: Check if the server is down, if a job fails. That will
delay a short time if the server is up, and delay a longer time if the
server is down.

Now let us assume server1 is down and removed. How do we add it back?
When should we retry if server1 is up? A failed try is expensive as
that delays everything (A simple test of ssh indicates it takes 120
seconds to timeout.) A way to mitigate this could be to timeout
earlier. We know how log it used to take to login to the host, so we
could use a timeout that is 10 times the original time.

Another way would be to add server1 back after some timeout and let it
be kicked again if a job fails on it again - doubling the timeout
before it can be considered again.

All in all doable, but it does not seem trivially simple.

> With respect to what to do with jobs currently running on the servers, I think
> GNU Parallel should simply wait until they complete. If the user really wants
> to kill them, he or she could do this manually (perhaps even using a second
> instance of GNU Parallel to issue a "broadcast kill"). Alternatively this
> decision could be left to the user via a user-defined parameter.

Having given this a bit more thought, it should be possible to set the
number of jobslots on a host to 0.

That would have the effect that no new jobs would be spawned.


/Ole



------------------------------

Message: 3
Date: Sat, 16 Aug 2014 11:51:28 +0200
From: Achim Gratz <Stromeko@nexgo.de>
To: parallel@gnu.org
Subject: Re: Dynamically changing remote servers list
Message-ID: <87bnrkep9b.fsf@Rainer.invalid>
Content-Type: text/plain; charset=utf-8

Ole Tange writes:
> I have been thinking along the same lines, but have been unable to
> find an easy way of doing that in practice.

While hard to solve in general, some of this is quite easy if you can
make a few assumptions.  Most server or VM farms can be scaled up or
down based on overall load, so I think these would be prime candidates
for this "servers come and go" scenario -- and they all have in-band and
sometimes out-of-band monitoring that you can tap into in various ways.

[?]
> All in all doable, but it does not seem trivially simple.
[?]

Just make it the responsibility of the user that each server in the list
given to parallel is actually reachable, don't second-guess the user.
That list may actually be something that the user just gets from
somewhere else, so you should perhaps be flexible with the expected
format.


Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+

Wavetables for the Terratec KOMPLEXER:
http://Synth.Stromeko.net/Downloads.html#KomplexerWaves




------------------------------

_______________________________________________
Parallel mailing list
Parallel@gnu.org
https://lists.gnu.org/mailman/listinfo/parallel


End of Parallel Digest, Vol 52, Issue 10
****************************************


reply via email to

[Prev in Thread] Current Thread [Next in Thread]