[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Disarchive database synchronization

From: Timothy Sample
Subject: Re: Disarchive database synchronization
Date: Sat, 18 Mar 2023 13:49:34 -0600
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)

Hey Ludo,

Ludovic Courtès <> writes:

> I copied over the 12K entries that were missing from
>  (Note that there are currently only two copies
> of the database: one at/in [bB]erlin, and one at/in [Bb]ordeaux.)
> now weighs in at 1.8 GiB for 31,839 entries.

Wow – 12K!  For some reason I thought it would be fewer.  It’s very good
that we (finally) sync’d up the databases.

Also, my set is now at 31,821 after collecting the runoff from the
latest Preservation of Guix Report.  That’s shockingly close to the
31,839 you have.

> For the remaining entries, it’s trickier.  Sometimes it’s just the
> gzip compression parameters that differ, which could be addressed with a
> little bit more work:
> $ file ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz 
> ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz
> ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz:          
>                gzip compressed data, max compression, from Unix, original 
> size modulo 2^32 446731
> ../../disarchive/sha256/ffdc77f5e5cb2390b9309de63eb7be68d9fe631e898f4da6c04a8159daefc2c0.gz:
>  gzip compressed data, max speed, from Unix, original size modulo 2^32 446731

I’m not sure getting the compressed files to match matters.  Disarchive
cares a lot about that when it comes to source code tarballs, because
everybody signs and computes checksums over the compressed versions.
However, for these files, the differences introduced by compression can
be ignored.

> Sometimes it’s trickier:
> # diff -u <(gunzip -d < 
> 0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz) <(gunzip 
> -d < 
> ../../disarchive/sha256/0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9.gz)
> --- /dev/fd/63  2023-03-14 16:13:21.635733426 +0100
> +++ /dev/fd/62  2023-03-14 16:13:21.635733426 +0100
> @@ -1,7 +1,7 @@
>  (disarchive
>    (version 0)
>    (gzip-member
> -    (name "webview-sys-0.6.2.tar.gz")
> +    (name "rust-webview-sys-0.6.2.tar.gz")
>      (digest
>        (sha256
>          "0001f025c1425ffe36270a81cb091eade87dd8d29ac773735ae47e1a8c8066c9"))
> @@ -13,7 +13,7 @@
>      (footer (crc 1807070134) (isize 121344))
>      (compressor zlib-best)
>      (input (tarball
> -             (name "webview-sys-0.6.2.tar")
> +             (name "rust-webview-sys-0.6.2.tar")
>               (digest
>                 (sha256
> "4fb18f3206838e11f7f8caba6fad9e0f796109428b502793b9f2f0613fe0f275"))
> @@ -78,7 +78,7 @@
>               (padding 0)
>               (input (directory-ref
>                        (version 0)
> -                      (name "webview-sys-0.6.2")
> +                      (name "rust-webview-sys-0.6.2")
>                        (addresses
>                          (swhid 
> "swh:1:dir:fa41df38bf639ada28c900b0915661e787fe6d15"))
>                        (digest

The name field is not used for data reconstruction.  It’s for human
consumption (and it may have made some early examples of use at the
command line easier to explain).  Here, the difference is based on the
fact that Crate URIs are weird, and the Preservation of Guix code does
not keep the origin file name.  Hence, the PoG version extracts the
Crate name alone from the URI, and the Cuirass version uses the Guix
package name with the “rust-” prefix.

> As Tim pointed out, Disarchive disassembly is not fully deterministic
> and/or might change a bit over time as Disarchive evolves, and that’s
> prolly what we’re seeing here.

I honestly think this is a good thing.  My instincts tell me that we
should excise all sources of ambiguity, like we’re trying to do in the
big picture.  However, Disarchive will get better at describing things
over time.  For instance, it doesn’t handle tar extension headers
elegantly at the moment.  In the future, if I fix this, I might consider
creating a “migrate” feature that improves existing specifications
(e.g., converting the old, verbose representation of extension headers
into the new representation).  In particular, I’ve left some warts in
the software in order to ship it, and I would be sad to try and commit
to those for the rest of time!

We might also add other resolver addresses besides SWHIDs....

Maybe I’m missing some perspective, but I don’t think trying to commit
to reproducible outputs for Disarchive makes sense.

-- Tim

P.S., we’ll have to do this dance again shortly, as I just computed
2,023 historical bzip2 specifications.  They’re not online yet, but
they’ll be up when I publish the next PoG report – which should take less
than a year this time!  :p

reply via email to

[Prev in Thread] Current Thread [Next in Thread]