myexperiment-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Myexperiment-discuss] Distributed Workflow


From: David Brown
Subject: [Myexperiment-discuss] Distributed Workflow
Date: Mon, 25 Jul 2011 14:32:15 -0700

I've been trying to gather some research on workflow engines and how
well they might apply to a distributed set of resources.

I've looked at a variety of workflow engines like Kepler and Taverna,
however I can't seem to find a reasonable solution without writing
code.

I mentioned my issues to Mark Borkum and he mentioned that I should
send a message to this ML to describe my issues.

The background of the issue:

EMSL (Environmental Molecular Science Laboratory) located at PNNL has
many instruments from microscopes to NMR to proteomics processing.
There's over 140 individual instruments in all. We also maintain and
manage a large compute resource 2300 compute node super computer its
about 3 years old and we're working toward getting a new one. We've
also got several other small (16'ish compute node) clusters that are
dedicated to some of the instruments for initial processing of the raw
data.

What I mean when I think scientific workflow:

When I talk to scientists here about scientific workflow there's a big
difference between how they talk about it and how the workflow engines
available today deal with workflow.  They talk about getting data
(files) out of the instruments transferring those files to compute
resources processing them in an automated fashion. These steps would
also include more than one compute resource along with gathering and
storing meta data along the way that would be useful for their
project.

There's many challenges with a project like this and many different
technologies that could be used to make it work. However, my focus is
on workflow and trying to make these disparate resources work together
and do so in an automated or semi-automated fashion. Now, I do like
the idea of SOAP/REST services for these systems and WSDL style self
description for how to use the resource including some of the
instruments and their controlling systems. However, there are
limitations to choosing an engine based on a DAG for these systems.

The crux of the issue is that the client needs to wait longer that a
timeout of a single tcp connection between two systems. Interfacing
with a super computer means that potentially your job may not start
for several days and after that it may take a couple of days to
complete. That source of interface is one I haven't seen done quite
right yet. The difference is that in one node in the DAG you submit a
request and get back an ID for that request. Then the client can query
for the state of the ID and at some point later that request is acted
on and the response has the output data. This interface could also be
adapted to deal with instruments that are operated by technicians.

What I'm wondering is if Taverna's engine could be adapted to meet
some of the requirements of this kind of interface in the DAG? Does it
make sense to adapt the engine to meet these requirements? or would
simply making a general plugin for Taverna be a better solution?
Similarly with Kepler if you've used or developed with that system.

Thanks,
- David Brown



reply via email to

[Prev in Thread] Current Thread [Next in Thread]