gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] a brief note on "delta compression"


From: Tom Lord
Subject: Re: [Gnu-arch-users] a brief note on "delta compression"
Date: Mon, 3 May 2004 16:28:19 -0700 (PDT)

    > From: address@hidden (James Blackwell)

    > You're the boss.

    > After Talli asked about the deltra compression (from the combining
    > multiple deltas standpoint), I sat back and thought about it for awhile. 

    > First, a definition, X is the distance between ancenstor revision M and
    > its succesor N (inclusive). I.E. The X between patch-45 and patch-47 is 2.

    > Right now, we store a "compressed delta" (CD)  of 1. This is obligatory,
    > as one can always be only one revision away from current, and just needs
    > an X of one to catch up. 

Just as an aside: we also like to keep CD(1) just because that's what
the programmer nominally approved -- the exact `tla changes' output
that he agreed to commit.   CD(1) captures not just some random
textual differences but the very textual differences that the informed
programmer deliberately published.

    > If we store a CD (in addition of to the CD with an X of 1) with an X of
    > two, we buy an apparent almost (but not quite!) doubling of speed. We
    > also double the size of our archives.

    > If we look at the opposite extreeme, say with a CD that has an X of 50,
    > it would seem that we would have a fifty fold increase of speed, but
    > this is *not true*! The reason is that this CD would only do good if
    > your local working copy was more than 50 revisions away from "current".

    > The lower we set X, the less point there is in bothering to do CD in the
    > first place. The higher we set X, the more useful the extra stored CD,
    > but the less likely that we'll actually be able to use it.

    > So we have to pick an X that is an estimate of how many revisions people
    > are missing at any given time. Too high, and these extra stored
    > revisions go unused, because nobody is that far away from the most
    > current revision to use the CD. Too low, and we're not gaining the
    > full benefit of storing extra CDs. 

    > My personal hunch is that the proper X is right around 5. I've got no
    > actual proof for that, mind you. 

Other ideas are to store a summary delta at every point where the sum
of the sizes of summarized deltas would exceed the size of the summary
delta.  In some sense, that's optimal.  We've tossed around the idea
of adding some size data to revision directories to facilitate this.

Another idea is to store overlapping summary deltas at different
scales.  Svn's skip-list-style "skip deltas" do this.  (I.e., store
summary deltas for patch-1..patch-4, patch-4..patch-7,
patch-7..patch-10, patch-10..patch-13, but also for patch-1..patch-7
and patch-7..patch-13.  So, if you want patch-11 and you have base-0,
you can apply the summary patch-1..patch-7, then patch-7..patch-10,
then patch-11.)

Yet another idea, for smart servers, is to cache custom-made summary
deltas.   The protocol design discussion we had a while back touched
on this.   The idea is:  suppose that 80% of the readers of the tla
archive last updated at patch-K.   Recently I announced a new release
at patch-N.    Many people need summary patch-K...patch-N.   If a
smart-server computes that and caches it on demand.....



    > Just for the record, I personally don't like the idea for the following
    > reasons: 

    >   1. In the case of corruption, who do you believe? Do you believe the CD
    >   with an X of 1, or do you believe the CD with an X of 5? 

It does complicate archive validation but you believe the CD(1) changesets.


    >   2. Empirically, I've seen *way* too many people dick (sorry for the
    >   language, but that's the appropriate word here) with their archives.

I agree.


    >   3. This undercuts cached revisions and revision libraries

And vice versa.   Nobody's done it yet because most of his solve the
problem using cached revisions and/or rev libs.

    >   4. Applying revisions is already pretty darn fast.

    >   5. Theres more important stuff out to do.

Exactly.

-t




reply via email to

[Prev in Thread] Current Thread [Next in Thread]