gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files


From: Aaron Bentley
Subject: Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Date: Wed, 25 Aug 2004 19:22:33 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040309)

Tom Lord wrote:
One thing that we've talked about is the idea of a "delta compressed
revision library" --- i.e., a "revision library" in the sense that it
contains trees extracted from archives and provides a client-side
cache;  "delta compressed" in the sense that it's optimized for access
to individual file history.

We kinda have that already, don't we? Library revisions contain their changeset in the ,,patch-set directory, and those are a kind of delta-compressed file history.


It isn't obvious to me, though, that this or any other new kind of
cache needs to be separately and specially coded.

There's only a small number of basis functions for "things we want to
be able to compute from archives".  E.g., there's
`patch_log_of(revision)' and `changeset_of(revision)' and
`changed_file_ids(revision)'.... stuff like that.   And it's not that
long a list.   Add in "second order" computations, like
`which_patches_change_file(id)' and the list *still* isn't that long.

All of those basis functions have full externalizable (i.e., printable,
readable) parameters and return values.

All of those basis functions are constant functions: given equal
arguments, they always produce equal results.

Some of those functions are expensive;  the second-order functions
depend on the first-order functions, including some expensive ones.

One idea here is to add a very general caching mechanism built around
those functions rather than around a particular data structure like a
revlib.  Schematically, instead of calling:

        log = patch_log_of (revision);
        touched_names = touched_file_names (log);

one would write:

        touched_named = query ("touched_names $revision");

if that result is cached, fine, otherwise, the cache is automatically
populated with:

        cached_result["touched_names $revision"]
         = touched_file_names (query ("patch_log_of $revision"));

and so forth.

Designed with care, a caching implementation of "query" might be able
to behave with all the important performance characteristics of the
less-generic caches we might want to add -- and with far less code!

Hmm.  I'm not sure what I think about that.

My focus has been on avoiding multiple downloads of the same data, and I've been thinking very specifically about archive caching. I think there are two levels of cacheable data:

1. transient data:
As it stands, tla pretends that nothing happens while it's accessing an archive. It's a useful fiction, but it's not accurate. Files may be added or removed while the process is running. The worst example might be attempting to build a revision from a cacherev that was removed after arch_revision_type was called. This sort of thing might be possible at a pfs level, and certainly at higher levels of abstraction.

This is the sort of data that goes stale, so it can't be retained indefinitely. It includes:
- presence or absense of cacherevs
- latest revision
- The absense of categories, branches, versions, revisions
- archive meta-info

2. permanent data:
Many aspects of archives are write-once.
- The type of a revision
- The changes introduced by a simple or continuation revision
- The contents of an import revision or cacherev
- The presence of categories, branches, versions, revisions
- The patchlog of a revision

Transient data roughly correlates to "what build-revision stores in ram", and permanent data is roughly "what might be found in an archive mirror".

In this light, it's unfortunate that arch_revision_type combines permanent data (the revision type) with transient data (whether it's cached).

(The idea of a general purpose cache is one of a list of reasons to be skeptical about too many changes or enhancements to archive
formats.   One virtue of the current format is its minimalism: there's
very little information that's redundently represented and the
information that's there is quite compressable.   So one can, for
example, read and write core arch revisions at just about the maximum
theoretical speed.   As always, speedups on top of that for particular
commands should almost always be client-side hacks and almost always
be based on memoizing or caching the results of constant functions.)

As it stands, there would be a clear advantage to listing continuations separately. In some cases, we call arch_revision_type hundreds of times in order to determine that a namespace-related revision is an actual descendant or ancestor. Similarly, we don't usually want to know which revisions *aren't* imports or cacherevs, we want to know which ones *are*.

If revision types can be stored permanently, that isn't too painful, but since cacherevs are transient, they really drag us down. An index (e.g. directory listing) that listed all cacherevs would allow us to scan the archive with one server round-trip.

So my perpective is that caching at the archive level is a pretty good match for our purposes. I don't think there's a query you can craft that can't be answered in terms of the archive interface, and I bet the benefit of caching 2nd-order queries is pretty minimal, compared with caching 1st-order queries.

Using the archive interface would mean we wouldn't need to rewrite our code to support caches. Archive caches would just be another form of archive. It would also mean we'd be able to use existing archive-pfs code to implement them. And transient caches could be implemented using the pfs code. This would allow write operations to affect read operations.

You could layer it like this:

arch_memoizer_archive <-> arch_transient_archive <-> arch_archive

Layering permanent storage on top of transient storage (e.g. an arch_memoizer_archive using an arch_transient_archive using an arch_archive) would be a great way to separate those concerns.

Aaron
--
Aaron Bentley
Director of Technology
Panometrics, Inc.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]