Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files

gnu-arch-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files

From:	Aaron Bentley
Subject:	Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Date:	Wed, 25 Aug 2004 19:22:33 -0400
User-agent:	Mozilla Thunderbird 0.5 (X11/20040309)

Tom Lord wrote:

One thing that we've talked about is the idea of a "delta compressed
revision library" --- i.e., a "revision library" in the sense that it
contains trees extracted from archives and provides a client-side
cache;  "delta compressed" in the sense that it's optimized for access
to individual file history.

We kinda have that already, don't we? Library revisions contain theirchangeset in the ,,patch-set directory, and those are a kind ofdelta-compressed file history.


It isn't obvious to me, though, that this or any other new kind of
cache needs to be separately and specially coded.

There's only a small number of basis functions for "things we want to
be able to compute from archives".  E.g., there's
`patch_log_of(revision)' and `changeset_of(revision)' and
`changed_file_ids(revision)'.... stuff like that.   And it's not that
long a list.   Add in "second order" computations, like
`which_patches_change_file(id)' and the list *still* isn't that long.

All of those basis functions have full externalizable (i.e., printable,
readable) parameters and return values.

All of those basis functions are constant functions: given equal
arguments, they always produce equal results.

Some of those functions are expensive;  the second-order functions
depend on the first-order functions, including some expensive ones.

One idea here is to add a very general caching mechanism built around
those functions rather than around a particular data structure like a
revlib.  Schematically, instead of calling:

        log = patch_log_of (revision);
        touched_names = touched_file_names (log);

one would write:

        touched_named = query ("touched_names $revision");

if that result is cached, fine, otherwise, the cache is automatically
populated with:

        cached_result["touched_names $revision"]
         = touched_file_names (query ("patch_log_of $revision"));

and so forth.

Designed with care, a caching implementation of "query" might be able
to behave with all the important performance characteristics of the
less-generic caches we might want to add -- and with far less code!


Hmm.  I'm not sure what I think about that.

My focus has been on avoiding multiple downloads of the same data, andI've been thinking very specifically about archive caching. I thinkthere are two levels of cacheable data:


1. transient data:

As it stands, tla pretends that nothing happens while it's accessing anarchive. It's a useful fiction, but it's not accurate. Files may beadded or removed while the process is running. The worst example mightbe attempting to build a revision from a cacherev that was removed afterarch_revision_type was called. This sort of thing might be possible ata pfs level, and certainly at higher levels of abstraction.

This is the sort of data that goes stale, so it can't be retainedindefinitely. It includes:

- presence or absense of cacherevs
- latest revision
- The absense of categories, branches, versions, revisions
- archive meta-info

2. permanent data:
Many aspects of archives are write-once.
- The type of a revision
- The changes introduced by a simple or continuation revision
- The contents of an import revision or cacherev
- The presence of categories, branches, versions, revisions
- The patchlog of a revision

Transient data roughly correlates to "what build-revision stores inram", and permanent data is roughly "what might be found in an archivemirror".

In this light, it's unfortunate that arch_revision_type combinespermanent data (the revision type) with transient data (whether it'scached).

(The idea of a general purpose cache is one of a list of reasons tobe skeptical about too many changes or enhancements to archive

formats.   One virtue of the current format is its minimalism: there's
very little information that's redundently represented and the
information that's there is quite compressable.   So one can, for
example, read and write core arch revisions at just about the maximum
theoretical speed.   As always, speedups on top of that for particular
commands should almost always be client-side hacks and almost always
be based on memoizing or caching the results of constant functions.)

As it stands, there would be a clear advantage to listing continuationsseparately. In some cases, we call arch_revision_type hundreds of timesin order to determine that a namespace-related revision is an actualdescendant or ancestor. Similarly, we don't usually want to know whichrevisions *aren't* imports or cacherevs, we want to know which ones *are*.

If revision types can be stored permanently, that isn't too painful, butsince cacherevs are transient, they really drag us down. An index (e.g.directory listing) that listed all cacherevs would allow us to scan thearchive with one server round-trip.

So my perpective is that caching at the archive level is a pretty goodmatch for our purposes. I don't think there's a query you can craftthat can't be answered in terms of the archive interface, and I bet thebenefit of caching 2nd-order queries is pretty minimal, compared withcaching 1st-order queries.

Using the archive interface would mean we wouldn't need to rewrite ourcode to support caches. Archive caches would just be another form ofarchive. It would also mean we'd be able to use existing archive-pfscode to implement them. And transient caches could be implemented usingthe pfs code. This would allow write operations to affect read operations.


You could layer it like this:

arch_memoizer_archive <-> arch_transient_archive <-> arch_archive

Layering permanent storage on top of transient storage (e.g. anarch_memoizer_archive using an arch_transient_archive using anarch_archive) would be a great way to separate those concerns.


Aaron
--
Aaron Bentley
Director of Technology
Panometrics, Inc.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gnu-arch-users] Inconsistency of Added-files and Removed-files, (continued)

Prev by Date: Re: [Gnu-arch-users] tla 1.2.1 release announcement
Next by Date: [Gnu-arch-users] Re: feature plans from over here....
Previous by thread: Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Next by thread: Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Index(es):
- Date
- Thread