gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Getting started with GWL 0.3.0


From: Ricardo Wurmus
Subject: Re: Getting started with GWL 0.3.0
Date: Fri, 26 Mar 2021 22:01:09 +0100
User-agent: mu4e 1.4.14; emacs 27.1

Hi Roel,

>> > Is there a feature-branch to try out GWL with Guile-DRMAA? :)
>> 
>> Unfortunately not yet.
>> 
>> I haven’t been 100% successful with the only DRMAA-enabled cluster that
>> I have access to, and it turns out that it’s not as simple as SGE’s
>> “hold_jid”.
>> 
>> It’s no longer “fire and forget”, which is a bit sad, but that’s how
>> DRMAA works.  We need a run-time component that keeps track of
>> submitted
>> jobs and their status and actively starts held jobs when the
>> prerequisites have finished.
>
> That's unfortunate, but I believe having a daemon that keeps track of
> the workflow opens possibilities for "cloud" "orchestration".

Yes, it’s pretty much the same mechanism, except that for the “cloud” we
generally don’t have a ready-made “select” or “wait” equivalent.  There
we would either need to write code that lets the instances contact a
coordination service or let the GWL process poll their status.

With DRMAA it’s pretty simple: we submit all jobs in hold state, then
start the first layer, and then we use the “wait” call to be notified of
any completed job.  The docstring in Guile DRMAA says:

--8<---------------cut here---------------start------------->8---
   "Wait for the completion of a job with identifier JOB-ID.  If the
JOB-ID is the special symbol '*, wait for the completion of any job that
has been submitted during this session.

TIMEOUT (an integer) specifies the number of seconds to block.  If it
is not provided or is #FALSE this procedure will block forever.

This procedure returns three values: the identifier of the job that
has completed, the status code of the job (an opaque value), and an
alist of resource usage statistics."
--8<---------------cut here---------------end--------------->8---

The GWL already knows the graph of processes and each process
corresponds to a submitted job, so with the return values of this
procedure it should really not be complicated to implement.

>> It’s not clear to me if and how we should persist workflow state.  The
>> GWL will submit all jobs to the scheduler in a held state and then
>> change their status when its their turn.  I wonder if and how we should
>> handle the case where the GWL runtime monitor dies and is restarted.
>> The easiest way is to simply kill all queued up jobs, but I don’t know
>> if there’s a better approach.
>> 
>> Ideas?
>
> I find killing/removing queued jobs upon exiting the runtime monitor a
> good idea!

With DRMAA this is very easy.  The “control” procedure allows us to kill
all jobs that were enqueued in the current session.  In Guile DRMAA
that’s

   (control '* 'terminate)

> I have access to a SLURM cluster (I don't know which version of DRMAA
> it supports), but I can test it.

SLURM has an external DRMAA 1.0 implementation; it is not included by
default.  In Guix that’s provided by the slurm-drmaa package.

-- 
Ricardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]