Re: intrinsic vs extrinsic identifier: toward more robustness?

guix-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: intrinsic vs extrinsic identifier: toward more robustness?

From:	Simon Tournier
Subject:	Re: intrinsic vs extrinsic identifier: toward more robustness?
Date:	Sun, 05 Mar 2023 21:21:18 +0100

Hi Maxime,

Thanks for your comments.

On Sat, 04 Mar 2023 at 01:08, Maxime Devos <maximedevos@telenet.be> wrote:

> To my understanding, there is only one 'real' identifier in Guix: the 
> (sha256sum (base32 ...)) (*).  Those other identifiers like the URL in 
> url-fetch and git-fetch are just hints on where to find the object -- 
> very important hints without which finding the object is much more 
> likely to fail, but just hints nonetheless.

I am not sure to understand why you mean by “hint”.  I would not call
URLs something like “just hints on where to find the object”.

NAR+SHA256 is only the ’real’ identifier when you allow
substitutes. Otherwise, Guix fetches using the ’uri’ from the field
’origin’.  And that’s the scenario I am envisioning here: for whatever
reasons, all the data in the stores Bordeaux and Berlin are gone, then
it is hard time for “guix time-machine”.

>> Intrinsic identifier also relies on a (trusted) map but collisions are
>> avoided as much as possible.  Somehow it strongly reduces the power of
>> the authority and it is often more robust.
>
> Who is 'the authority' here, how does the absence of collision reduces 
> the power of the authority, and what is your point about reducing the 
> power of the authority?

Considering intrinsic identifier, the “authority” is the data itself,
somehow.  In content-addressed systems, the “authority” is diluted or
absent.

>> Whatever the intrinsic identifier we consider – even ones based on very
>> weak cryptographic hash function as MD5, or based on non-crytographic
>> hash function as Pearson hashing, etc. – the integrity check is
>> currently done by SHA256.
>
> How about using the hash of the integrity check as an intrinsic 
> identifier, like is done currently?  I mean, we hash it anyway with 
> sha256 for the integrity check anyway, might as reuse it.

Maybe ask GNUnet folk to address by NAR+SHA256 instead on their
specification. ;-)

Kidding aside, your comment rises two points of view:

 1. Guix is fetching data from elsewhere and this elsewhere is not using
    NAR+SHAR256 intrinsic identifier.  Therefore, the question is how to
    adapt the source origin for taking into account this elsewhere?

 2. Replace the NAR+SHA256 integrity checksum by what content-addressed
    systems use as intrinsic identifier.  IMHO, that’s a bad idea for
    two reasons: (a) security, for instance SHA1 as used by SWH is not
    secure and (b) it will be unmanageable in practise.

>> All that’s said, Guix uses extrinsic identifiers for almost all origins,
>> if not all.  Even for ’git-fetch’ method.
>
> For git-fetch, the value of the 'commit' field is intrinsic (except when 
> it's a tag instead).

No, it is imprecise.  The exception is *not* label tag as value for the
’commit’ field but the exception is Git commit hash as value.

> This can be solved by placing the actual commit in the 'commit' field of 
> git-reference, instead of the tag name, then things are completely 
> unambiguous -- this and its opposite were discussed in ‘On raw strings 
> in <origin> commit field’ (*), IIRC.

The thread you are referencing [1] is based on misunderstandings.  I
would like to move forward, hence my detailed email. :-)

1: 
<https://yhetil.org/guix/6e451a878b749d4afb6eede9b476e5faabb0d609.camel@gmail.com/#r>

> (*) Also maybe that thread about tricking peer review.
>
> I didn't understand the position that commit field should contain the 
> (indirect, fragile) tag instead of the (direct, robust) commit, but 
> those differences could be sidestepped by having both a 'tag' field and 
> a 'commit' field, IIUC.

I would not frame this way.  My view is not to replace something by
something else, instead, is to add something and/or several things.

> The problem then was to somehow map the NAR hash to the FS identifier.

Yes, that’s the problem. :-) GNUnet FS identifier is one case.  And my
discussion here is: could we augment source origin to be able to deal
with various identifier?

> A straightforward solution would be to just replace the https:// by 
> gnunet:// in the origin (like in https://issues.guix.gnu.org/44199, 
> except that patch doesn't support fallbacks to other URLs like url-fetch 
> does).

Somehow, your proposition would be to have a list as URI, right?

     (origin
       (method gnunet-fetch)
       (uri
        (list
          (string-append "mirror://gnu/hello/hello-" version
                           ".tar.gz")

"gnunet://fs/chk/TY48PGS5RVX643NT2B7GDNFCBT4DWG692PF4YNHERR96K6MSFRZ4ZWRPQ4KVKZV29MGRZTWAMY9ETTST4B6VFM47JR2JS5PWBTPVXB0.8A9HRYABJ7HDA7B0"
          "shw:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"
       (file-name "gnunet-hello-2.10.tar.gz")
       (sha256
        (base32
         "0ssi1wpaf7plaswqqjwigppsg5fyh99vdlb9kzl7c9lng89ndq1i")

>> It is not affordable, neither wanted, to switch from the current
>> extrinsic identification to a complete intrinsic one.  Although it would
>> fix many issues. ;-)
>
> How about in-between: include both an intrinsic identifier (the 
> sha256sum) and an extrinsic identifier (the URLs to locate the object 
> at), like the status quo.

That’s what I am proposing between the lines. :-)

The question is which design.  For instance, it could go under the field
’properties’ similarly as “upstream name” or potentially other
“metadata”.  Or it could go under the source origin field.

Well, however as you pointed, being a ’properties’ would not be as
easy.  And as you also pointed, the integrity field could be something
else than ’sha256’, so maybe we could have a list here.

>> The discussion could also fit how to distribute using ERIS.
>
> ERIS is not a method on its own; you need to combine it with a P2P 
> network that uses ERIS.  I do not understand the special focus on ERIS.

Yes, indeed.  However, to my knowledge, each P2P can use its own
identifier and from my understanding, ERIS relies on whatever P2P.
Therefore, willing guix-daemon being able to use ERIS, it somehow
implies a discussion about the identifiers used by the P2P networks.

Do I miss something?

>> At some point, I was thinking to have something like “guix freeze -m
>> manifest.scm” returning a map of all the sources from the deep bootstrap
>> to the leaf packages described in manifest.scm.  However, maybe
>> something is poor in the metadata we collect at package time.
>
> That sounds like "guix build --sources=transitive' to me, except for 
> being even more transitive.  I propose making this an additional option 
> for the --sources argument instead.

No.  “guix build --sources=transitive” returns an archive containing all
the sources.  Instead, I would like the all various identifiers (URL,
NAR, SWHID, GNUnet, etc.) of all the transitive sources.

Cheers,
simon

PS:

>> However the fields ’swhid’ and the other SHA256 ’digest’ are different
>> from above.  That’s because the dots [...] part.  It probably comes from
>> the normalization process. Well, I am not sure to deeply understand why
>> it is different but that’s another story. :-)
>
> The reason for the normalisation was something about SWH only providing 
> tarballs whose contents are equal to the ingested tarball; the tarballs 
> are not bit-for-bit identical to the ingested tarball.  But Guix needs 
> bit-for-bit identical tarballs, so Disarchive contains the information 
> that was stripped-out by SWH to complement the tarballs provided by 
> Disarchive.

SWH is not in the picture with the example I provided. :-)  Yes, the
dots part is related to some normalization and “metadata”.

What I do not understand is, if “guix build hello -S” is manually
uncompressed and untar, the content corresponds to:

    $ guix hash -S git -H sha256 -f hex hello-2.12.1
    cc7d5c45cfa1f5fba96c8b32d933734b24377a3c1ac776650044e497469affd4

The tool ’disarchive’ dissembles the compressed archive; it first
provides the hash of the compressed archive (.tar.gz), then store
metadata about compression level, algorithm etc, then provides the hash
of the uncompressed archive (.tar), then store metadata about files and
last it provides the hash of the tree, it reads,

    (input (directory-ref
             (version 0)
             (name "3dq55rw99wdc4g4wblz7xikc8a2jy7a3-hello-2.12.1")
             (addresses
               (swhid "swh:1:dir:9c1eecffa866f7cb9ffdd56c32ad0cecb11fcf2a"))
             (digest
               (sha256

"1cb6effd40736b441a2a6dd49e56b3dfd4f6550e8ae1a8ac34ed4b1674097bc0"))))))))

and I do not understand why it is not the same as manually computed; see
above.   Well, that’s a detail and not relevant to the current
discussion since it is part of how Disarchive works internally.

[Prev in Thread]

Current Thread

[Next in Thread]

intrinsic vs extrinsic identifier: toward more robustness?, Simon Tournier, 2023/03/03
- Re: intrinsic vs extrinsic identifier: toward more robustness?, Maxime Devos, 2023/03/03
  - Re: intrinsic vs extrinsic identifier: toward more robustness?, Maxim Cournoyer, 2023/03/03
  - Re: intrinsic vs extrinsic identifier: toward more robustness?, Simon Tournier <=
    - Re: intrinsic vs extrinsic identifier: toward more robustness?, Maxime Devos, 2023/03/06
    - Re: intrinsic vs extrinsic identifier: toward more robustness?, Simon Tournier, 2023/03/07
- Re: intrinsic vs extrinsic identifier: toward more robustness?, Ludovic Courtès, 2023/03/16

Prev by Date: Re: Merging core-updates? OFF TOPIC PRAISE
Next by Date: Secure boot support?
Previous by thread: Re: intrinsic vs extrinsic identifier: toward more robustness?
Next by thread: Re: intrinsic vs extrinsic identifier: toward more robustness?
Index(es):
- Date
- Thread