[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
From: |
Aaron Bentley |
Subject: |
Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files |
Date: |
Wed, 25 Aug 2004 19:22:33 -0400 |
User-agent: |
Mozilla Thunderbird 0.5 (X11/20040309) |
Tom Lord wrote:
One thing that we've talked about is the idea of a "delta compressed
revision library" --- i.e., a "revision library" in the sense that it
contains trees extracted from archives and provides a client-side
cache; "delta compressed" in the sense that it's optimized for access
to individual file history.
We kinda have that already, don't we? Library revisions contain their
changeset in the ,,patch-set directory, and those are a kind of
delta-compressed file history.
It isn't obvious to me, though, that this or any other new kind of
cache needs to be separately and specially coded.
There's only a small number of basis functions for "things we want to
be able to compute from archives". E.g., there's
`patch_log_of(revision)' and `changeset_of(revision)' and
`changed_file_ids(revision)'.... stuff like that. And it's not that
long a list. Add in "second order" computations, like
`which_patches_change_file(id)' and the list *still* isn't that long.
All of those basis functions have full externalizable (i.e., printable,
readable) parameters and return values.
All of those basis functions are constant functions: given equal
arguments, they always produce equal results.
Some of those functions are expensive; the second-order functions
depend on the first-order functions, including some expensive ones.
One idea here is to add a very general caching mechanism built around
those functions rather than around a particular data structure like a
revlib. Schematically, instead of calling:
log = patch_log_of (revision);
touched_names = touched_file_names (log);
one would write:
touched_named = query ("touched_names $revision");
if that result is cached, fine, otherwise, the cache is automatically
populated with:
cached_result["touched_names $revision"]
= touched_file_names (query ("patch_log_of $revision"));
and so forth.
Designed with care, a caching implementation of "query" might be able
to behave with all the important performance characteristics of the
less-generic caches we might want to add -- and with far less code!
Hmm. I'm not sure what I think about that.
My focus has been on avoiding multiple downloads of the same data, and
I've been thinking very specifically about archive caching. I think
there are two levels of cacheable data:
1. transient data:
As it stands, tla pretends that nothing happens while it's accessing an
archive. It's a useful fiction, but it's not accurate. Files may be
added or removed while the process is running. The worst example might
be attempting to build a revision from a cacherev that was removed after
arch_revision_type was called. This sort of thing might be possible at
a pfs level, and certainly at higher levels of abstraction.
This is the sort of data that goes stale, so it can't be retained
indefinitely. It includes:
- presence or absense of cacherevs
- latest revision
- The absense of categories, branches, versions, revisions
- archive meta-info
2. permanent data:
Many aspects of archives are write-once.
- The type of a revision
- The changes introduced by a simple or continuation revision
- The contents of an import revision or cacherev
- The presence of categories, branches, versions, revisions
- The patchlog of a revision
Transient data roughly correlates to "what build-revision stores in
ram", and permanent data is roughly "what might be found in an archive
mirror".
In this light, it's unfortunate that arch_revision_type combines
permanent data (the revision type) with transient data (whether it's
cached).
(The idea of a general purpose cache is one of a list of reasons to
be skeptical about too many changes or enhancements to archive
formats. One virtue of the current format is its minimalism: there's
very little information that's redundently represented and the
information that's there is quite compressable. So one can, for
example, read and write core arch revisions at just about the maximum
theoretical speed. As always, speedups on top of that for particular
commands should almost always be client-side hacks and almost always
be based on memoizing or caching the results of constant functions.)
As it stands, there would be a clear advantage to listing continuations
separately. In some cases, we call arch_revision_type hundreds of times
in order to determine that a namespace-related revision is an actual
descendant or ancestor. Similarly, we don't usually want to know which
revisions *aren't* imports or cacherevs, we want to know which ones *are*.
If revision types can be stored permanently, that isn't too painful, but
since cacherevs are transient, they really drag us down. An index (e.g.
directory listing) that listed all cacherevs would allow us to scan the
archive with one server round-trip.
So my perpective is that caching at the archive level is a pretty good
match for our purposes. I don't think there's a query you can craft
that can't be answered in terms of the archive interface, and I bet the
benefit of caching 2nd-order queries is pretty minimal, compared with
caching 1st-order queries.
Using the archive interface would mean we wouldn't need to rewrite our
code to support caches. Archive caches would just be another form of
archive. It would also mean we'd be able to use existing archive-pfs
code to implement them. And transient caches could be implemented using
the pfs code. This would allow write operations to affect read operations.
You could layer it like this:
arch_memoizer_archive <-> arch_transient_archive <-> arch_archive
Layering permanent storage on top of transient storage (e.g. an
arch_memoizer_archive using an arch_transient_archive using an
arch_archive) would be a great way to separate those concerns.
Aaron
--
Aaron Bentley
Director of Technology
Panometrics, Inc.
- Re: [Gnu-arch-users] Inconsistency of Added-files and Removed-files, (continued)
- [Gnu-arch-users] [BUG] Patchlog Removed-files mentions explicit id-tags, David Allouche, 2004/08/17
- Re: [Gnu-arch-users] Inconsistency of Added-files and Removed-files, James Blackwell, 2004/08/17
- Re: [Gnu-arch-users] Inconsistency of Added-files and Removed-files, Aaron Bentley, 2004/08/17
- [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Miles Bader, 2004/08/17
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Tom Lord, 2004/08/17
- [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Miles Bader, 2004/08/17
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Tom Lord, 2004/08/18
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Aaron Bentley, 2004/08/18
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Tom Lord, 2004/08/18
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files,
Aaron Bentley <=
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Tom Lord, 2004/08/26
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Tom Lord, 2004/08/26
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Aaron Bentley, 2004/08/26
- [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Stefan Monnier, 2004/08/18
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Aaron Bentley, 2004/08/18
- [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Stefan Monnier, 2004/08/18
- Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Jason McCarty, 2004/08/18
- [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files, Stefan Monnier, 2004/08/18