gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Managing data files in workflows


From: Ricardo Wurmus
Subject: Re: Managing data files in workflows
Date: Fri, 26 Mar 2021 09:47:11 +0100
User-agent: mu4e 1.4.14; emacs 27.1

Hi Konrad,

> Coming from make-like workflow systems, I wonder how data files are best
> managed in GWL workflow. GWL is clearly less file-centric than make
> (which is a Good Thing in my opinion), but at a first reading of the
> manual, it doesn't seem to care about files at all, except for
> auto-connect.
>
> A simple example:
>
> ==================================================
> process download
>   packages "wget"
>   outputs
>     file "data/weekly-incidence.csv"
>   # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv 
> }
>
> workflow influenza-incidence
>   processes download
> ==================================================

This works for me correctly:

--8<---------------cut here---------------start------------->8---
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
The following derivations will be built:
   /gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv
   /gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv

building path(s) 
`/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle'
building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir'
building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir'
building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database'
Creating manual page database...
[  2/  3] building list of man-db entries...
108 entries processed in 0.1 s
building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile'
building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm'
run: Executing: /bin/sh -c 
/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs) 
(outputs "./data/weekly-incidence.csv") (values) (name . "download"))' 
--2021-03-26 09:41:17--  http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’

./data/weekly-incidence.csv                        [ <=>                        
                                                                         ]  
83.50K  --.-KB/s    in 0.05s   

2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]

$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
run: Skipping process "download" (cached at 
/tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/).
$ --8<---------------cut here---------------end--------------->8---

Here’s the changed workflow:

--8<---------------cut here---------------start------------->8---
process download
  packages "wget" "coreutils"
  outputs
    file "data/weekly-incidence.csv"
  # {
    mkdir -p $(dirname {{outputs}})
    wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
  }

workflow influenza-incidence
  processes download
--8<---------------cut here---------------end--------------->8---

It skips the process because the output file exists and the daring
assumption we make is that outputs are reproducible.

I would like to make these assumptions explicit in a future version, but
I’m not sure how.  An idea is to add keyword arguments to “file” that
allows us to provide a content hash, or merely a flag to declare a file
as volatile and thus in need of recomputation.

I also wanted to have IPFS and git-annex support, but before I embark on
this I want to understand exactly how this should behave and what the UI
should be.  E.g. having an input that is declared as “IPFS-file” would
cause that input file to be fetched automatically without having to
specify a process that downloads it first.  (Something similar could be
implemented for web resources as in your example.)

-- 
Ricardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]