gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gwl-devel] support for containers


From: Ricardo Wurmus
Subject: Re: [gwl-devel] support for containers
Date: Wed, 30 Jan 2019 13:46:49 +0100
User-agent: mu4e 1.0; emacs 26.1

Hi Simon,

> On Wed, 30 Jan 2019 at 00:16, Ricardo Wurmus <address@hidden> wrote:
>
>> Since we don’t hash the data (because it’s expensive) the scripts are
>> “proxies” for the data files.  We compute the hashes over the dependent
>> scripts and assume that this is enough to decide whether to recompute
>> data files or to serve them from the cache/store.
>
> Just to be sure to well understand your point, let pick the simple
> example from genomics pipeline:
>  FASTQ -align-> BAM -variant-> VCF
> So, you intend to hash:
>  - the data FASTQ
>  - the scripts align and variant
> Or only the scripts containing reference to inputs (here FASTQ), where
> the reference is a location fixed by the user.

Currently, there is no good way for a user to pass inputs to a workflow,
so I haven’t yet thought about how to handle the user’s input files.
This still needs to be done.  Currently, the only way a user can provide
files as inputs is by writing a process that “generates” the file (even
if it does so by merely accessing the impure file system).  That’s
rather inconvenient and it wouldn’t work in a container where only
declared files are available.

Users should be able to map files to any process input from the command
line (or through a configuration file).  For a provided input we should
take into account the hash of some file property: the timestamp and the
name (cheap), or the contents (expensive).

As regards hashing the scripts here’s what I have so far:

--8<---------------cut here---------------start------------->8---
(define (workflow->data-hashes workflow engine)
  "Return an alist associating each of the WORKFLOW's processes with
the hash of all the process scripts used to generate their outputs."
  (define make-script (process->script engine))
  (define graph (workflow-restrictions workflow))

  ;; Compute hashes for chains of scripts.
  (define (kons process acc)
    (let* ((script (make-script process #:workflow workflow))
           (hash   (bytevector->u8-list
                    (sha256 (call-with-input-file script get-bytevector-all)))))
      (cons
       (cons process
             (append hash
                     ;; Hashes of processes this one depends on.
                     (append-map (cut assoc-ref acc <>)
                                 (or (assoc-ref graph process) '()))))
       acc)))
  (map (match-lambda
         ((process . hashes)
          (cons process
                (bytevector->base32-string
                 (sha256
                  (u8-list->bytevector hashes))))))
       (fold kons '()
             (workflow-run-order workflow #:parallel? #f))))
--8<---------------cut here---------------end--------------->8---

I.e. for any process we want the hash over the script used for the
current process and for all processes that lead up to the current one.

This gives us a hash string for every process.  We can then look up
“${GWL_STORE}/${hash}/output-file-name” — if it exists we use it.  The
workflow runner would now also need to ensure that process outputs are
linked to the appropriate GWL_STORE location upon successful execution.

--
Ricardo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]