bug#42162: Recovering source tarballs

bug-guix
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#42162: Recovering source tarballs

From:	Ludovic Courtès
Subject:	bug#42162: Recovering source tarballs
Date:	Fri, 31 Jul 2020 16:41:59 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
Hi Timothy!

Timothy Sample <samplet@ngyro.com> skribis:

> This jumped out at me because I have been working with compression and
> tarballs for the bootstrapping effort.  I started pulling some threads
> and doing some research, and ended up prototyping an end-to-end solution
> for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata,
> and an SWH directory ID.  It can even put them back together!  :)  There
> are a bunch of problems still, but I think this project is doable in the
> short-term.  I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and
> found and fixed a bunch of little gaffes.  There’s a ton of work to do,
> of course, but here’s another small step.
>
> I call the thing “Disarchive” as in “disassemble a source code archive”.
> You can find it at <https://git.ngyro.com/disarchive/>.  It has a simple
> command-line interface so you can do
>
>     $ disarchive save software-1.0.tar.gz
>
> which serializes a disassembled version of “software-1.0.tar.gz” to the
> database (which is just a directory) specified by the “DISARCHIVE_DB”
> environment variable.  Next, you can run
>
>     $ disarchive load hash-of-something-in-the-db
>
> which will recover an original file from its metadata (stored in the
> database) and data retrieved from the SWH archive or taken from a cache
> (again, just a directory) specified by “DISARCHIVE_DIRCACHE”.

Wooohoo!  Is it that time of the year when people give presents to one
another?  I can’t believe it.  :-)

> Now some implementation details.  The way I’ve set it up is that all of
> the assembly happens through Guix.  Each step in recreating a compressed
> tarball is a fixed-output derivation: the download from SWH, the
> creation of the tarball, and the compression.  I wanted an easy way to
> build and verify things according to a dependency graph without writing
> any code.  Hi Guix Daemon!  I’m not sure if this is a good long-term
> approach, though.  It could work well for reproducibility, but it might
> be easier to let some external service drive my code as a Guix package.
> Either way, it was an easy way to get started.
>
> For disassembly, it takes a Gzip file (containing a single member) and
> breaks it down like this:
>
>     (gzip-member
>       (version 0)
>       (name "hungrycat-0.4.1.tar.gz")
>       (input (sha256
>                "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))
>       (header
>         (mtime 0)
>         (extra-flags 2)
>         (os 3))
>       (footer
>         (crc 3863610951)
>         (isize 194560))
>       (compressor gnu-best)
>       (digest
>         (sha256
>           "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh")))

Awesome.

> The header and footer are read directly from the file.  Finding the
> compressor is harder.  I followed the approach taken by the pristine-tar
> project.  That is, try a bunch of compressors and hope for a match.
> Currently, I have:
>
>     • gnu-best
>     • gnu-best-rsync
>     • gnu
>     • gnu-rsync
>     • gnu-fast
>     • gnu-fast-rsync
>     • zlib-best
>     • zlib
>     • zlib-fast
>     • zlib-best-perl
>     • zlib-perl
>     • zlib-fast-perl
>     • gnu-best-rsync-1.4
>     • gnu-rsync-1.4
>     • gnu-fast-rsync-1.4

I would have used the integers that zlib supports, but I guess that
doesn’t capture this whole gamut of compression setups.  And yeah, it’s
not great that we actually have to try and find the right compression
levels, but there’s no way around it it seems, and as you write, we can
expect a couple of variants to be the most commonly used ones.

> The “input” field likely points to a tarball, which looks like this:
>
>     (tarball
>       (version 0)
>       (name "hungrycat-0.4.1.tar")
>       (input (sha256
>                "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))
>       (default-header)
>       (headers
>         ((name "hungrycat-0.4.1/")
>          (mode 493)
>          (mtime 1513360022)
>          (chksum 5058)
>          (typeflag 53))
>         ((name "hungrycat-0.4.1/configure")
>          (mode 493)
>          (size 130263)
>          (mtime 1513360022)
>          (chksum 6043))
>         ...)
>       (padding 3584)
>       (digest
>         (sha256
>           "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")))
>
> Originally, I used your code, but I ran into some problems.  Namely,
> real tarballs are not well-behaved.  I wrote new code to keep track of
> subtle things like the formatting of the octal values.

Yeah I guess I was too optimistic.  :-)  I wanted to have the
serialization/deserialization code automatically generated by that
macro, but yeah, it doesn’t capture enough details for real-world
tarballs.

Do you know how frequently you get “weird” tarballs?  I was thinking
about having something that works for plain GNU tar, but it’s even
better to have something that works with “unusual” tarballs!

(BTW the code I posted or the one in Disarchive could perhaps replace
the one in Gash-Utils.  I was frustrated to not see a ‘fold-archive’
procedure there, notably.)

> Even though they are not well-behaved, they are usually
> self-consistent, so I introduced the “default-header” field to set
> default values for all headers.  Any omitted fields in the headers use
> the value from the default header, and the default header takes
> defaults from a “default default header” defined in the code.  Here’s
> a default header from a different tarball:
>
>     (default-header
>       (uid 1199)
>       (gid 30)
>       (magic "ustar ")
>       (version " \x00")
>       (uname "cagordon")
>       (gname "lhea")
>       (devmajor-format (width 0))
>       (devminor-format (width 0)))

Very nice.

> Finally, the “input” field here points to an “swh-directory” object.  It
> looks like this:
>
>     (swh-directory
>       (version 0)
>       (name "hungrycat-0.4.1")
>       (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a")
>       (digest
>         (sha256
>           "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")))

Yay!

> I have a little module for computing the directory hash like SWH does
> (which is in-turn like what Git does).  I did not verify that the 100
> packages where in the SWH archive.  I did verify a couple of packages,
> but I hit the rate limit and decided to avoid it for now.
>
> To avoid hitting the SWH archive at all, I introduced a directory cache
> so that I can store the directories locally.  If the directory cache is
> available, directories are stored and retrieved from it.

I guess we can get back to them eventually to estimate our coverage ratio.

>> I think we’d have to maintain a database that maps tarball hashes to
>> metadata (!).  A simple version of it could be a Git repo where, say,
>> ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would
>> contain the metadata above.  The nice thing is that the Git repo itself
>> could be archived by SWH.  :-)
>
> You mean like <https://git.ngyro.com/disarchive-db/>?  :)

Woow.  :-)

We could actually have a CI job to create the database: it would
basically do ‘disarchive save’ for each tarball and store that using a
layout like the one you used.  Then we could have a job somewhere that
periodically fetches that and adds it to the database.  WDYT?

I think we should leave room for other hash algorithms (in the sexps
above too).

> This was generated by a little script built on top of “fold-packages”.
> It downloads Gzip’d tarballs used by Guix packages and passes them on to
> Disarchive for disassembly.  I limited the number to 100 because it’s
> slow and because I’m sure there is a long tail of weird software
> archives that are going to be hard to process.  The metadata directory
> ended up being 13M and the directory cache 2G.

Neat.

So it does mean that we could pretty much right away add a fall-back in
(guix download) that looks up tarballs in your database and uses
Disarchive to recontruct it, right?  I love solved problems.  :-)

Of course we could improve Disarchive and the database, but it seems to
me that we already have enough to improve the situation.  WDYT?

> Even with the code I have so far, I have a lot of questions.  Mainly I’m
> worried about keeping everything working into the future.  It would be
> easy to make incompatible changes.  A lot of care would have to be
> taken.  Of course, keeping a Guix commit and a Disarchive commit might
> be enough to make any assembling reproducible, but there’s a
> chicken-and-egg problem there.

The way I see it, Guix would always look up tarballs in the HEAD of the
database (no need to pick a specific commit).  Worst that could happen
is we reconstruct a tarball that doesn’t match, and so the daemon errors
out.

Regarding future-proofness, I think we must be super careful about the
file formats (the sexps).  You did pay attention to not having implicit
defaults, which is perfect.  Perhaps one thing to change (or perhaps
it’s already there) is support for other hashes in those sexps: both
hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git
tree with different hash algorithm, IPFS CID, etc.).  Also the ability
to specify several hashes.

That way we could “refresh” the database anytime by adding the hash du
jour for already-present tarballs.

> What if a tarball from the closure of one the derivations is missing?
> I guess you could work around it, but it would be tricky.

Well, more generally, we’ll have to monitor archive coverage.  But I
don’t think the issue is specific to this method.

>> Anyhow, we should team up with fellow NixOS and SWH hackers to address
>> this, and with developers of other distros as well—this problem is not
>> just that of the functional deployment geeks, is it?
>
> I could remove most of the Guix stuff so that it would be easy to
> package in Guix, Nix, Debian, etc.  Then, someone™ could write a service
> that consumes a “sources.json” file, adds the sources to a Disarchive
> database, and pushes everything to a Git repo.  I guess everyone who
> cares has to produce a “sources.json” file anyway, so it will be very
> little extra work.  Other stuff like changing the serialization format
> to JSON would be pretty easy, too.  I’m not well connected to these
> other projects, mind you, so I’m not really sure how to reach out.

If you feel like it, you’re welcome to point them to your work in the
discussion at <https://forge.softwareheritage.org/T2430>.  There’s one
person from NixOS (lewo) participating in the discussion and I’m sure
they’d be interested.  Perhaps they’ll tell whether they care about
having it available as JSON.

> Sorry about the big mess of code and ideas – I realize I may have taken
> the “do-ocracy” approach a little far here.  :)  Even if this is not
> “the” solution, hopefully it’s useful for discussion!

You did great!  I had a very rough sketch and you did the real thing,
that’s just awesome.  :-)

Thanks a lot!

Ludo’.
[Prev in Thread]
Current Thread
[Next in Thread]
bug#42162: Recovering source tarballs, (continued)
Prev by Date: bug#42423: icedtea: JAVA_HOME
Next by Date: bug#26302: Multilingual web site is on-line!
Previous by thread: bug#42162: Recovering source tarballs
Next by thread: bug#42164: Combining file-append with gexps results in incomprehensible errors
Index(es):
- Date
- Thread