gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] archive storage format comments on the size


From: Tom Lord
Subject: Re: [Gnu-arch-users] archive storage format comments on the size
Date: Mon, 29 Sep 2003 17:46:07 -0700 (PDT)

    > From: Andrea Arcangeli <address@hidden>

    > [long message in which many questions are expressed as 
    >  speculative ideas for redesigning this or that.]

Speaking of compression, couldn't you (please) have expressed much
(not all, I know) of your message with something like:

    > [As nearly as I can see, using contextless and forward-only
    >  diffs would save a substantial amount of disk space in achives.
    >  Similarly, one could eliminate the extra copy of the log entry
    >  and the copies of removed files.  Since it is only important to
    >  optimize for `tla get', wouldn't it be better to do so?

    >  [report of various measurements]]

There were some other questions in your message too, but I'll leave
those for now.

I doubt that many experienced users would agree that `tla get' is the
only operation to optimize for.  Other operations which take advantage
of the current space/time trade-offs include `replay', `revisions',
`cat-archive-log', and `get-patch'.  Still other operations certainly
_could_ take advantage of that space/time trade-off include `update',
`star-merge', `revdelta', and `deltapatch'.

Furthermore, in a general purpose changeset format, rather than one
just for archive storage, bidirectionality and context are certainly
vital.   So if the dumb-server archive format were to reject
bidirectionality and context, that would mean that it could not
re-use a general purpose changeset format.

I am uncertain of the usefulness (and even the meaning) of the
measurements you've offered.   In any event, each time I or someone
working with me has quantified some of these issues, over a fairly
broad sampling of archives, the result has at least subjectively been
that the current trade-offs are quite reasonable.   I'm not sure how
much more can be said about that without a larger design or deployment
context. 

It is perhaps worth observing that when optimizing for `get', we
already use client-side revlibs and server-side cachedrevs -- further
trading space for time -- and now some users are working out how to
add summary deltas: yet another space for time trade-off.   Generally,
the consensus is that, at these scales, time is far more precious than
space.  

An exception to the general rule of the concensus is revision
libraries which are quite suitable for many projects, but clearly not
suitable (when used as a fully-populated library) for large trees
managed by people with less than the latest and greatest hardware.
That is why you can find threads on the list talking about
"interpolated diffs storage" for revision libraries, why the bug
tracker has a request for options to `library-add' to make sparse
libraries easier to manage, and why there's some discussion about
where and what sort of additional hooks to drop in to the code to help
manage sparse libraries.  Of those solutions, all but interpolated
diffs are trivial changes -- and so they seem a good "fit" for an
economic situation in which, in a few years, most of the issues will
fade away.  Interpolated diffs are interesting because, regardless of
storage costs, they support a very fast implementation of `annotate'.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]