gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Processing large amounts of files


From: Ricardo Wurmus
Subject: Re: Processing large amounts of files
Date: Wed, 27 Mar 2024 10:58:10 +0100
User-agent: mu4e 1.10.8; emacs 29.1

Liliana Marie Prikler <liliana.prikler@ist.tugraz.at> writes:

> Am Dienstag, dem 26.03.2024 um 22:30 +0100 schrieb Ricardo Wurmus:
>> 
>> Ricardo Wurmus <rekado@elephly.net> writes:
>> > Another significant delay is introduced by the cache mechanism,
>> > which computes a unique prefix based on the contents of all input
>> > files.  It's not unexpected that this will take a little while, but
>> > it's not great either.
>> 
>> With commit f4442e409cf05d0c7cc4d6a251626d22efaffe8c it's a little
>> faster.  We used a whole lot of alists, and this becomes slow when
>> there are thousands of inputs.  We're now using hash tables.
> SGTM.  I assume the caches are internal and do not affect input order
> otherwise?  i.e. a process that declares
>
>   inputs : files "foo" "bar" "baz"
>
> will still see the same {{inputs}} as before?

Yes, the order should always be the same.

> I see there are tests
> covering make-process, but I'm not quite sure how to parse "prepare-
> inputs returns the unmodified inputs-map when all files exist" tbh.

Input handling is a big bag of compromises.  In the distant past
workflows hardcoded input file names, which were assumed to be present
at runtime.  That wasn't great for my use cases, which was to specify a
workflow as a generic thing that has deterministic behavior but allows
for plugging in different input files.

That's why I decoupled process scripts from their inputs; inputs are
passed as arguments to these unchanging scripts.

GWL currently assumes that *any* input anywhere in the workflow can be
injected by the user.  There is an option to provide an input mapping,
which maps an existing file to an input file name in the workflow.

GWL will first compute free inputs, i.e. inputs that are not provided by
any of the outputs of any process in the workflow.  GWL expects that
these free inputs are either declared by the user or --- and this is a
pragmatic decision, that I'm not too happy with --- that a file matching
the input name can be found relative to the current directory.

The above test is for the simple case where no files were discovered
to fill the slots of computed free inputs.


The caching mechanism exists to avoid rerunning processes when their
output files already exist.  In the presence of input maps and file
discovery relative to the current working directory, however, it is
necessary to rerun processes when the input files differ.

GWL computes hashes of the mapped input files and of all process scripts
to arrive at a cache prefix.  This cache prefix is derived from a chain
of hashes that covers the workflow definitions and the effective inputs.
Given the same input files and the same workflow we can avoid running
the whole workflow again when the cache already contains outputs from a
previous run.

-- 
Ricardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]