gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] a brief note on "delta compression"


From: Tom Lord
Subject: [Gnu-arch-users] a brief note on "delta compression"
Date: Mon, 3 May 2004 11:47:39 -0700 (PDT)


Someone called my attention to and asked me to comment on an IRC
(#arch) discussion on the topic of "delta compression".   The
discussion is at:

    http://www.thecodemill.biz/services//arch/irc/2004-04-30

and begins at:

    14:38:17


The basic question was:

    talli> "what is the arch stance on delta compression?"

Terminology in the revision control world is not quite as consistent
and universal as one might help.   The term delta compression is used
in at least two ways.



* "delta compression" -- meaning "store revisions as deltas"

One meaning of "delta compression", informally stated, is that
"revisions are stored as changesets (or diffs) relative to other
revisions".  This is, of course, exactly what arch does when you use
`commit':  you create "patch-N+1" which is represented in an archive
as a changeset describing how that revision differs from "patch-N".

That's "compression" because that's a compact way to store a revision.
It's "delta compression" because the compression technique is based on
"deltas" (roughly aka "changesets" or "patches").

Revision control systems that don't use delta compression in that
sense are rare, but not unheard of.  As an anecdotal example, I've
heard of at least one shop that used a whole-file-with-sharing storage
format comparable to arch's revision libraries as it's primary storage
management strategy.  As disk sizes increase, such strategies are
increasingly tempting --- however, i/o bandwidth considerations make
them non-trivial to do both well and scalably.




* "delta compression" -- meaning "store pre-combined deltas"

The second meaning of "delta compression" is vaguer, but generally
refers to replacing a sequence of deltas (aka changesets), which are
meant to be applied sequentially, with a single changeset that
represents their composition.

For example, considering just one file, I might have:

        delta 1:

        replace line 2 with "hello"
        replace line 3 with "world"


        delta 2:

        replace line 2 with "hello,"
        replace line 4 with "   -- K&R"


and a "compressed delta" of these is:

        delta 1+2

        replace line 2 with "hello,"
        replace line 3 with "world"
        replace line 4 with "   -- K&R"



As you can see, delta "1+2" is only three lines while the sum of the
lengths of deltas "1" and "2" is four lines.   So we have combined two
deltas, compressing them in the process.

The questions then arise: when, where, and why would a revision
control system use "compressed deltas" (the second meaning of "delta
compression").

One red-herring idea is to use compressed deltas to save space in
archives.  After checking in the changes for delta 2, for example, I
might discard delta 1 from an archive and store only delta 1+2.  This
idea is a red herring because: (a) disk space is not even remotely
close to being that precious;  (b) the cost of computing the
compressed delta would therefore be a waste of time; and worst of all,
(c), such an approach discards valuable history -- the details of
delta 1 have been lost.

A more serious idea is to use compressed deltas to save cpu time
and/or i/o bandwidth, at the _cost_ of disk space.  For example,
consider what it takes, naively, to build revision 2 in our example.
We start with a base revision, then apply delta 1, then apply delta
2.   That implies the cpu and i/o bandwidth costs of fetching and then
applying both deltas.

We can save some costs by storing in the archive not only delta 1 and
delta 2, but _also_ delta 1+2.    Now if I want to build revision 2, I
need only fetch and apply a single delta, namely delta 1+2.  

That's a fine idea that unambiguously applies in that case that the
only way your revision control system ever builds a revision is by
reading deltas from an archive and applying them.  For example: That's
(very nearly) the only way Subversion ever builds a revision (there's
a single, special-case exception in the "text base" mechanism -- of
which arch's in-tree pristines are a generalization).   Consequently,
Subversion makes agressive and tricky use of compressed deltas (ask
your favorite Subversion developer to explain "skip-deltas" some
time).

Sometimes (and exceptionally) people configure and use arch in a
manner such that the only way revisions are built is by reading deltas
from an archive and applying them.   And, indeed, for that reason,
there has been occaisional requests for a skip-delta-style facility
for arch.   One has been designed.   It's fairly easy to implement (it
would take a skilled arch programmer perhaps 2-3 weeks).   It would
work out very comparably to SVN's skip-deltas.

Interestingly enough, though -- nobody has bothered yet.  Why not?  I
As far as I can tell, it's because 90+% of the time arch's facilities
for archive-cached revisions and revision libraries solve the same
problem, better, and more simply, albeit at the cost of more disk
space.  While arch's alternatives use a little more disk space, in
dollar terms, the excess is trivial.  Simpler is better, in this case.

Don't get me wrong -- I think it's _fairly_ likely we'll add
compressed deltas in archives at some point.  If nothing else, it'll
be fun.   But basically there's no pressing need for it.

                   --------------------------------

Some of the IRC discussion confused me a little bit and made me wonder
if there isn't a third or more defintion of "delta compression"
floating around that I'm not aware of.    If so, or if questions
remain about the topic, please let me know.

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]