[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-tar] Reproducibility of tar archives

From: Jakob Bohm
Subject: Re: [Help-tar] Reproducibility of tar archives
Date: Mon, 1 Apr 2019 23:56:06 +0200
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

On 01/04/2019 22:15, Yann E. MORIN wrote:
Jakob, All,

On 2019-04-01 12:12 +0200, Jakob Bohm spake thusly:
On 31/03/2019 14:08, Yann E. MORIN wrote:
So, here's my question: starting with tar-1.32 (the latest release as of
today), is the gnu tar format considered stable now, or is there no
guarantee about the gnu tar format stability?

For reference, here's how we generate the archives:

     tar cf - \
         --numeric-owner --owner=0 --group=0 --mtime="${date}" \
         --format=gnu -T "list.sorted" >"${output}.tar"

Can we expect this to be reproducible with future tar releases?

As a more general solution for others in a similar predicament, could
GNU tar add the ability to explicitly request the formats produced by
earlier versions, for example by adding options such as
--format=gnu1.27 and --format=gnu1.30(named for the versions that
first introduced the specific format changes, with a view to add new
ones as future changes are introduced).

Since we can't predict what the future will be made of, I find it
interesting to be indeed able to specify exactly what version of the
format to use, because as it is, --format=gnu means different things
with differnt tar versions, so they are essentially different formats.

So yes, I like this proposal.

Alternatively, could the Buildroot and GNU tar teams check if one of
the historic formats already explicitly supported by the --format
option provides the required stability.

Fact is, older formats that are "stable" are not all capable of storing
the necessary information, like filenames or paths > 100 chars, or
extended attributes and so on...

Either way, the difference is between two interpretations of the
--format option: A. Restrict the output to headers that are
understood by specific old/3rd party unpackers.  B. Reproduce a
very specific output, including how tar chooses between seemingly
equivalent header types, ignored values etc.  This includes
bugward compatibility with historic tar output bugs that made the
wrong choices.

It is not so much about older unpackers to understand the format: older
tar version _are_ able to extract tarballs created with 1.30-onward.

Rather, it's that archives made with older tar versions can't be
reproduced with newer tar versions, because, as you very nicely
pointed out, they really generate another format.

I was not stating that you didn't need reproducability.  I was stating
that different uses of tar would naturally expect different meanings of
the option.

The 3rd option, consistent with how reproducible builds are
otherwise done, is to treat tar as part of the tool chain, thus
making the exact build or source version of tar part of the list
of exact tool versions needed to reproduce a specific build (just
like there is already a requirement to use exact versions of gcc,
autotools etc.), doing so would also allow the historic hash values
to remain valid, as they are each tied to the tar version they were
historically built with.

The problem is that today, Buildroot uses tar-1.29, so all hashes are
generated with that "gnu-1.29" format, and they eventually percolate to
our source mirror (aka backup):

The problem that I don't understand is this:

In which situations does Buildroot recreate a tar file that doesn't
contain built/generated files and expect it to be exactly the same tar
archive as a different build configuration?

Do those situations incorporate other computed files, such as the
result of running autotools on a file in an upstream
source?  If so, the generated tar content already depends on the
versions of tools (such as autotools) used, and tar would belong to
the same version control as those tools.

Do those situations really need to recreate the tar file instead of
downloading it from and checking the hash?

Tomorrow, we update Buildroot to use, say, tar-1.32. All existing
archives are to be done again, because their hash do not match. And
thus the newer archives would eventually replace the old ones. And then
older builds could no longer use those new archives, because they would
not match the old hashes...

That's why having a stable format is very important: we can generate
archives at various point in time, and be able to reuse them later as
they use the same scheme.

I see the point of having tar part of the toolchain, but that means that
the source archives can no longer be shared between builds; they
actually become artifacts of the build rather than the source...


Jakob Bohm, CIO, Partner, WiseMo A/S.
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded

reply via email to

[Prev in Thread] Current Thread [Next in Thread]