[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Managing data files in workflows
From: |
Ricardo Wurmus |
Subject: |
Re: Managing data files in workflows |
Date: |
Fri, 26 Mar 2021 09:47:11 +0100 |
User-agent: |
mu4e 1.4.14; emacs 27.1 |
Hi Konrad,
> Coming from make-like workflow systems, I wonder how data files are best
> managed in GWL workflow. GWL is clearly less file-centric than make
> (which is a Good Thing in my opinion), but at a first reading of the
> manual, it doesn't seem to care about files at all, except for
> auto-connect.
>
> A simple example:
>
> ==================================================
> process download
> packages "wget"
> outputs
> file "data/weekly-incidence.csv"
> # { wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
> }
>
> workflow influenza-incidence
> processes download
> ==================================================
This works for me correctly:
--8<---------------cut here---------------start------------->8---
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
The following derivations will be built:
/gnu/store/59isvjs850hm6ipywhaz34zvn0235j2g-gwl-download.scm.drv
/gnu/store/s8yx15w5zwpz500brl6mv2qf2s9id309-profile.drv
building path(s)
`/gnu/store/izhflk47bpimvj3xk3r4ddzaipj87cny-ca-certificate-bundle'
building path(s) `/gnu/store/i7prqy908kfsxsvzksr06gxks2jd3s08-fonts-dir'
building path(s) `/gnu/store/pzcqa593l8msd4m3s0i0a3bx84llzlpa-info-dir'
building path(s) `/gnu/store/7f5i86dw32ikm9czq1v17spnjn61j8z6-manual-database'
Creating manual page database...
[ 2/ 3] building list of man-db entries...
108 entries processed in 0.1 s
building path(s) `/gnu/store/mrv97q0d2732bk3hmj91znzigxyv1vsh-profile'
building path(s) `/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm'
run: Executing: /bin/sh -c
/gnu/store/chz5lck01vd3wlx3jb35d3qchwi3908f-gwl-download.scm '((inputs)
(outputs "./data/weekly-incidence.csv") (values) (name . "download"))'
--2021-03-26 09:41:17-- http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
Resolving www.sentiweb.fr (www.sentiweb.fr)... 134.157.220.17
Connecting to www.sentiweb.fr (www.sentiweb.fr)|134.157.220.17|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘./data/weekly-incidence.csv’
./data/weekly-incidence.csv [ <=>
]
83.50K --.-KB/s in 0.05s
2021-03-26 09:41:18 (1.63 MB/s) - ‘./data/weekly-incidence.csv’ saved [85509]
$ guix workflow run foo.w
info: Loading workflow file `foo.w'...
info: Computing workflow `influenza-incidence'...
run: Skipping process "download" (cached at
/tmp/gwl/lf6uca7zcyyldkcrxn3zwc275ax3ip676aqgjo75ybwojtl4emoq/).
$ --8<---------------cut here---------------end--------------->8---
Here’s the changed workflow:
--8<---------------cut here---------------start------------->8---
process download
packages "wget" "coreutils"
outputs
file "data/weekly-incidence.csv"
# {
mkdir -p $(dirname {{outputs}})
wget -O {{outputs}} http://www.sentiweb.fr/datasets/incidence-PAY-3.csv
}
workflow influenza-incidence
processes download
--8<---------------cut here---------------end--------------->8---
It skips the process because the output file exists and the daring
assumption we make is that outputs are reproducible.
I would like to make these assumptions explicit in a future version, but
I’m not sure how. An idea is to add keyword arguments to “file” that
allows us to provide a content hash, or merely a flag to declare a file
as volatile and thus in need of recomputation.
I also wanted to have IPFS and git-annex support, but before I embark on
this I want to understand exactly how this should behave and what the UI
should be. E.g. having an input that is declared as “IPFS-file” would
cause that input file to be fetched automatically without having to
specify a process that downloads it first. (Something similar could be
implemented for web resources as in your example.)
--
Ricardo