[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#42162: Recovering source tarballs
From: |
zimoun |
Subject: |
bug#42162: Recovering source tarballs |
Date: |
Thu, 27 Aug 2020 11:41:24 +0200 |
Hi,
On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet@ngyro.com> wrote:
> zimoun <zimon.toutoune@gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question. A good part of the size comes from the
> representation rather than the data. Compression helps a lot here. I
> have a database of 3,912 packages. It’s 295M uncompressed (which is a
> little better than your estimation). If I pass each file through Lzip,
> it shrinks down to 60M. That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K). I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.
Thank you for these numbers. Really interesting!
First, I do not know if the database needs to be stored with Git. What
should be the advantage? (naive question :-))
On SWH T2430 [1], you explain the “default-header” trick to cut down the
size. Nice!
Moreover, the format is a long list, e.g.,
--8<---------------cut here---------------start------------->8---
(headers
((name "raptor2-2.0.15/")
(mode 493)
(mtime 1414909500)
(chksum 4225)
(typeflag 53))
((name "raptor2-2.0.15/build/")
(mode 493)
(mtime 1414909497)
(chksum 4797)
(typeflag 53))
((name "raptor2-2.0.15/build/ltversion.m4")
(size 690)
(mtime 1414908273)
(chksum 5958))
[…])
--8<---------------cut here---------------end--------------->8---
which is human-readable. Is it useful?
Instead, one could imagine shorter keywords:
((na "raptor2-2.0.15/")
(mo 493)
(mt 1414909500)
(ch 4225)
(ty 53))
which using your database (commit fc50927) reduces from 295MB to 279MB.
Or even plain list:
(\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
(\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)
where the first element provides the “type” of list to ease the reader.
Well, the 2 naive questions are: does it make sense to
- have the database stored under Git?
- have an human-readable format?
Thank you again for pushing forward this topic. :-)
All the best,
simon
[1] https://forge.softwareheritage.org/T2430#47522