gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files


From: Tom Lord
Subject: Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Date: Thu, 26 Aug 2004 13:44:08 -0700 (PDT)

    > From: Aaron Bentley <address@hidden>

    > Tom Lord wrote:
    > > One thing that we've talked about is the idea of a "delta compressed
    > > revision library" --- i.e., a "revision library" in the sense that it
    > > contains trees extracted from archives and provides a client-side
    > > cache;  "delta compressed" in the sense that it's optimized for access
    > > to individual file history.

    > We kinda have that already, don't we?  Library revisions contain their 
    > changeset in the ,,patch-set directory, and those are a kind of 
    > delta-compressed file history.

We lack tools for building or browsing revisions based only the
changesets in revlibs.   We lack confirmation that that is a cost
effective approach (and the amount of file i/o involved especially
compared to the kinds of tuning that go into SCCS-family formats.... i
think we'd have big performance problems).  We lack tools for creating
revlibs that contain changesets but not trees (and is that certainly
what one would want, anyway?)

I'm once again back to the idea that we need a general purpose
persistent cache for anything we might compute from archive or revlib
data.  Just write a trivial algorithm using primitives like
"read-mod-files-index-from-revision", let that work by brute force
behind the scenes, but keep a big cache of all such results.

*That* might have a good chance of giving us an easy to write,
well-performing version of "annotate" and many other commands as well
(at the cost of the trickier to write cache).

    >> [earlier description of caching.]

    > Hmm.  I'm not sure what I think about that.

Mmhh.

    > My focus has been on avoiding multiple downloads of the same data, and 
    > I've been thinking very specifically about archive caching.  I think 
    > there are two levels of cacheable data:

    > 1. transient data:
    > As it stands, tla pretends that nothing happens while it's accessing an 
    > archive.  It's a useful fiction, but it's not accurate.  Files may be 
    > added or removed while the process is running.  The worst example might 
    > be attempting to build a revision from a cacherev that was removed after 
    > arch_revision_type was called.  This sort of thing might be possible at 
    > a pfs level, and certainly at higher levels of abstraction.

    > This is the sort of data that goes stale, so it can't be retained 
    > indefinitely.  It includes:
    > - presence or absense of cacherevs
    > - latest revision
    > - The absense of categories, branches, versions, revisions
    > - archive meta-info


    > 2. permanent data:
    > Many aspects of archives are write-once.
    > - The type of a revision
    > - The changes introduced by a simple or continuation revision
    > - The contents of an import revision or cacherev
    > - The presence of categories, branches, versions, revisions
    > - The patchlog of a revision

I think that that is a false distinction and here is why:

It is clearer to view what you are calling "transient information" as
permanent data for which only imperfect access methods exist.  For
example, the cacherev tar bundle (or, really, the set of equivalent
tar bundles) is well defined for every single revision.  In principle,
it is *always* available (so long as sufficient history is
accessible).  

We have just one way to get a cachrev tar bundle and that's to fetch
it from a dumb-fs archive.  It's an unreliable and unstable access
procedure, I agree, and that's why the revision-building algorithms
use it in the heuristic manner they do.

When I say that the equivalence set of cacherev tar bundles is well
defined, I mean that we could, in principle, give it a global arch
name... something like:

  address@hidden/foo--bar--1.0--patch-53/full-tree.tar.gz

and anyone who has access to the raw commmit data can compute a value 
which is equal to the one named there.


    > Transient data roughly correlates to "what build-revision stores in 
    > ram", and permanent data is roughly "what might be found in an archive 
    > mirror".

I understand but, examples like the need for a good "annotate"
implementation suggest that build-revision's "in ram" needs are 
not unique.   We already know that computing those "in ram" values is
expensive in some cases that can be sped up considerably with
persistent local caching.    Why not, then, make a very simple general
purpose caching mechanism?


    > In this light, it's unfortunate that arch_revision_type combines 
    > permanent data (the revision type) with transient data (whether it's 
    > cached).

It produces only a hint about whether a get_cacherev call will likely
succeed for that revision.  Sure -- that hint can't be cached, but it
can be propogated up as a property of the cache (i.e., the
cache-access routines can offer similar hints about the expense of
producing certain values).


    > As it stands, there would be a clear advantage to listing continuations 
    > separately.  In some cases, we call arch_revision_type hundreds of times 
    > in order to determine that a namespace-related revision is an actual 
    > descendant or ancestor.  Similarly, we don't usually want to know which 
    > revisions *aren't* imports or cacherevs, we want to know which ones *are*.

Instead of augmenting the archive format or building hairy new
attachments to it, problems such as that can be broken down into
hopefully re-usable subproblems, with the answers to problems and
subproblems being managed by the cache.  For example:

You cite examples of having to examine many revisions to trace
ancestry.

Ancestry can be traced by examining log messages: for one thing we
could be caching those or their headers.

Another idea might be to cache 10-revision segments of ancestry
traces, indexing them by the 10th in the list.   After at most 10
initial probes we can find those in the cache and save a lot of i/o
time reading log files.

A cache could be quite simple: just give all of these constant-data
results (like an ancestry trace or a cacherev tar bundle) a name in
the (thus extended) global arch namespace and let people attach rules
for computing (or trying to compute) desired values from just their
name.

  address@hidden/foo--bar--1.0--patch-53/full-tree.tar.gz

  implemented by "arch_compute_full_tree_tar_bundle (char * revision_uri)"

client code would use that as something like:

        if (arch_maybe_get ("address@hidden/full-tree.tar.gz"))
          we have the full-tree tar bundle
        else
          we don't


    > If revision types can be stored permanently, that isn't too painful, but 
    > since cacherevs are transient, they really drag us down.  An index (e.g. 
    > directory listing) that listed all cacherevs would allow us to scan the 
    > archive with one server round-trip.

Hence the reason to want a simpler, less presumptive interface likst
`arch_maybe_get' and `arch_get', `arch_put', etc.

Then pick the names carefully so that you can ask for the various
parts of a revision separately.   Implementations of the cache-filling
rules for these parts can internally use the current pfs layer.

    > So my perpective is that caching at the archive level is a pretty good 
    > match for our purposes.  

These aren't necessarily incompatible.   We can extend a general
purpose cache archive-side.    I suppose the most important thing I'm
recommending is the get/put style interface using an extended arch
namespace. 


-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]