Re: [Gnu-arch-users] archive storage format comments on the size

gnu-arch-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] archive storage format comments on the size

From:	Andrea Arcangeli
Subject:	Re: [Gnu-arch-users] archive storage format comments on the size
Date:	Tue, 30 Sep 2003 00:39:24 +0200
User-agent:	Mutt/1.4.1i
On Mon, Sep 29, 2003 at 02:15:50PM -0700, Tom Lord wrote:
> 
> 
> 
>     > From: Andrea Arcangeli <address@hidden>
> 
>     > I tried to convert two random cvs repositories to arch archives and the
>     > sizes are like this:  [  (4.8..5.8)x larger for arch ]
> 
> Since cscvs is still in the waining days of "experimental", and since
> you're fairly new to arch, it's possible that _part_ of the difference
> is just something silly in this particular conversion.  (A message
> shortly after yours even suggests a possibly relevant glitch in
> cscvs.)
> 
> However, in general, yup: the base archive size of arch is a bit
> larger than CVS.
> 
> Comparing to CVS on archive size is something to be done carefully,
> for a few reasons.
> 
> *) Think from first principles, instead.
> 
>    The theoretical minimum of information in a whole tree commit is
>    something like the contents of added lines and files, the addresses
>    of added and deleted lines, the log message, and enough of an
>    inventory index list to figure out the tree-deltas, plus enough 
>    structuring goo to make those components indexable.
> 
>    The space inefficiences in arch are that it adds: contents of
>    deleted lines and files, context of diffs, an extra copy of 
>    the log file, and some overhead costs associated with using

why can't the not strictly needed stuff be removed? We know the
patchsets can't reject during checkout, why should we carry all this
overhead with us when that can be deduced at runtime? Is it feasible to
regenerate at runtime? I was thinking starting with a diff -u0. I
believe the only case where we need a larger diff is when we call the
merging (the -u paramter effectively controls how strict the patch
should be and -u3 is a normal good default), we never need anything
bigger than -u0 never during checkouts, and the merging isn't that
performance critical.

>    directory structure (including in tar files) as the "structuring
>    goo" instead of something smaller.    That's not all that much 
>    and each element represents a reasonable space/time trade-off.
> 
>    Gzip is pretty fast as it's used here;   tar is decent at
>    compressing the use of directories as structuring foo;  as you
>    noted, gzip is beneficial for network traffic:  these factors

If you want to do a real efficient network transfer you've to ungzip
first anyways, but at the moment it's certainly a good trade-off for
simplicity of implementation. And the log is uncompressed already, so it
shouldn't be duplicated, just access the uncompressed one instead. I'd
like to shrink the protocol of the patchsets as small as possible and to
leave it uncompressed, then we can add some heuristic to create a big
tar.gz archive of the old uncompressed format in the archive, that will
be compressed x20, not x2. Reading the whole big archive and
uncompressing it in memory should be much more efficient in performance
too (unless you want a single patchset and not a full checkout, but
that's not a common operation, unlike a full checkout).

>    offset some of the space inefficiencies.
> 
>    Storing an archive on an ext* filesystem is a slight negative
>    offset compared to, say, an ffs filesystem because of the poor
>    space performance on small files.

this isn't a big deal actually I measured that too. A gzip -d of the
3M .tar.gz generated a 65M tar file, that means removing the overhead of
the filesystem, the uncompressed size was still over 65M. There was no
waste of space due the fs in the tar file. Also if somebody really cares
about the unused partial blocks reiserfs provides tail packing in linux
too (and it's preferable anyways when dealing with large directories
with small files like an arch archive).

>    Space consumption, measured as a function of "labor performed on
>    the project" is (taking your figures as an outlier) probably
>    something like 2-6x greater than CVS. (A rough guess.)

Let's stick to the uncompressed data diffs only for simplicity, you've a
diff per file in each patchset. So in terms of amount of data generated
at each commit it shouldn't matter if these diffs are appended at the
end of a ,v file, or if they're stored in a new file (ignore the fs
overhead that can be solved in different ways). The other metadata
should be always fairly small. It's just not clear why it's so much
bigger in uncompressed form.  I feel like something that should be small
isn't small. Certainly the -u3 explains some of it.

> *) Think in economic terms, both current and historical.
> 
>    Currently, a, say, 10x inefficiency compared to CVS is, frankly,
>    not much money.  The very painful exception is people in the same
>    situation as me: people who have some old machine and don't have
>    money to buy a new disk -- thus live with N+1-year old storage
>    economics.  I'm pretty active on a fairly large amount of source; I
>    work with a decent number of other people's archives.  I mirror
>    lots of stuff; keep a full revlib of my own stuff and partial
>    revlib of one or two contributors.  When I started arch the
>    pessimistic end of my projection was that I would risk filling up
>    my disk roughly around now -- In fact, I'm only 1/3 of the way
>    there.
> 
>    CVS makes certain space/time trade-offs differently.   It uses a
>    storage model designed -- what, more than 20 years ago?   Measured
>    in dollars, the space consumption differences between CVS and arch
>    have fallen, during that period, by a few orders of magnitude.
>    A-buck-a-gig-and-falling is a very different world from that in
>    which CVS was designed.   (Tagging and branching are examples of
>    operations in which the space/time trade-offs of CVS become quite
>    apparent -- not to mention the associated robustness issues.)
> 
>    This is why you frequently see comments in arch-world like "mirror
>    stuff", "build revlibs", "build an ancillary index for that".

you can sure solve problems by throwing money into the hardware, these
days storage is exceptionally cheap than it has ever been, but I don't
normally take it as a good argument while developing software (unless
the problem is not visible or the additional overhead payoffs in other
common operations or features).

Also note the one I converted is a very tiny repository, nothing nearly
comparable as the kernel. There are tenths thousand of commits, sometime
huge too (with lots of context lines with -u3, many hunks), in the kernel.

>     > at around >2000 patchsets archs slowsdown like a crawl in the commits,
>     > it seems to spend all its time doing a flood of lstat on every patchset
>     > file in the {arch} directory inside the working dir, basically it's O(N)
>     > where N is the number of patchsets in the archive. cscvs spends all its
>     > time in wait4 (waiting tla to return). Note, linux should be fairly
>     > efficient in the stats, thanks to the dcache, indipendently by the fs
>     > used.
> 
> You are, here, measuring cscvs performance -- not tla.   And you have
> a slight misunderstanding of patch logs.
> 
> Mass-conversion by cscvs is fundamentally a crunch-n-grind batch
> process that's not typical of how arch is used.   It's a "make

oh sure I know commit isn't performance crtitical (not as nearly as a
checkout). I'm not saying the O(N) complexity is a showstopper, I'm just
saying the commits will slow down over time while it probably shoudln't.
But for an human, waiting 1millisecond or 2 seconds over time probably
doesn't matter.

> possible" case not a common case.   If it became a common case, the
> best solution might very well be for cscvs to write an archive
> directly -- bypassing tla entirely -- not fussing with project trees
> -- computing changesets directly from the CVS data.   Tla is nice this
> way:  that'd be a tractable problem.

I believe the current way of creating the archive via tla is sure
preferable, the effort of creating the archive directly is sure not
worthwhile.

I understand you have to stat all files in the tree (I don't want to tla
edit), but I don't see why you've to stat all the internal _patchsets_
metadata inside the {arch} directory. I just don't see that.

I mean the commit has to be O(N) but IMHO N should be the number of
files in the tree (outside the {arch} directory), not the number of
patchsets into the {arch} directory. I just don't see why you stat all
the patchsets inside the {arch} directory. Probably there's a good
reason for that, I just don't see it.

Don't take me wrong, I'm not saying those are showstoppers, but if they
can be fixed they'd sure better be fixed. And no, it's not urgent at
all.

> In terms of operations that arch-users typically perform during
> day-to-day work: speed has been getting better and better.   For most
> trees, it's pretty damn good at the moment.   For very large trees
> using explicit tagging -- some of the latest optimizations haven't
> caught up with that yet but probably will in the next 2-3 months.

Cool! and btw, I'm using explicit tagging obviously, I want the strict
commits and the pollution would provide no benefit to me at all since I
_need_ the strict commits anyways. Also btw, my strict commits are
really implemented this below way (not as described in the other email
that forbidden me to checkin until there was garbage in the tree, I want
the garabge to be ignored, but still the other email pointed me in the
right direction to implement it):

explicit
untagged-source junk

Infact I would like the commit to shutup totally about the junk, I only
want to see the "known" tagged files shown in the commit process (or the
commit output will be totally polluted in presence of tons of junk), but
that's a cosmetic issue, explicit plus untagged-source junk seems to
work great!

The reason I threat it as junk is because I definitely want to be able
to commit even with tons of untagged files in the tree. The ideal way to
work is to have the tree compiled and to be able to edit it and commit
over time, without ever doing a distclean, so a new compilation to test
the new changed will be as quickly as possible.

> I've made some suggestions recently of what hacks to contirbute to
> accelarate that process.

Is diff -u0 one of the ideas that will be implemented?

> Patch-logs do not have to retain every entry from all of the history
> of a tree.   They need only retain a subset for the "active region" of
> history.   (In those rare cases where you need to start merging from
> very old work, if your logs have been pruned since then, sync-tree is
> your friend.)


thanks a lot,
Andrea - If you prefer relying on open source software, check these links:
            rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
            http://www.cobite.com/cvsps/
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Gnu-arch-users] Re: archive storage format comments on the size, (continued)
Prev by Date: Re: [Gnu-arch-users] Re: give us a hand with arch
Next by Date: Re: [Gnu-arch-users] Re: give us a hand with arch
Previous by thread: Re: [Gnu-arch-users] archive storage format comments on the size
Next by thread: Re: [Gnu-arch-users] archive storage format comments on the size
Index(es):
- Date
- Thread