gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files


From: Aaron Bentley
Subject: Re: [Gnu-arch-users] Re: Inconsistency of Added-files and Removed-files
Date: Thu, 26 Aug 2004 18:18:08 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040309)

Tom Lord wrote:
We lack tools for building or browsing revisions based only the
changesets in revlibs.

Yes. I've considered library-as-readonly archive, but I doubt it's a win. Apparently, the revision browsers do use revlibs changesets.

I'm once again back to the idea that we need a general purpose
persistent cache for anything we might compute from archive or revlib
data.  Just write a trivial algorithm using primitives like
"read-mod-files-index-from-revision", let that work by brute force
behind the scenes, but keep a big cache of all such results.

I guess the problem I have is seeing how that fits into tla as it stands. You wouldn't put that query in apply-changeset, would you? It's more efficient to get mod-files-index and orig-files-index (etc.) at the same time.

It is clearer to view what you are calling "transient information" as
permanent data for which only imperfect access methods exist.  For
example, the cacherev tar bundle (or, really, the set of equivalent
tar bundles) is well defined for every single revision.  In principle,
it is *always* available (so long as sufficient history is
accessible).

I'm not sure I agree. Given sufficient history, we have the ability to produce a given source tree. And a cacherev is just a representation of that tree. So I'd be quite comfortable saying "the source tree is well defined for every single revision". But cacherevs are only important as part of the implementation of producing source trees, and that implementation needs true answers about what resources are available, in order to perform well.

When I say that the equivalence set of cacherev tar bundles is well
defined, I mean that we could, in principle, give it a global arch
name... something like:

  address@hidden/foo--bar--1.0--patch-53/full-tree.tar.gz

Yeah, now you're cooking with gas.  The tree, not the cacherev or import.

This is an above-archive level of abstraction; it belongs somewhere like find_or_make_local_copy.

> Transient data roughly correlates to "what build-revision stores in > ram", and permanent data is roughly "what might be found in an archive > mirror".

I understand but, examples like the need for a good "annotate"
implementation suggest that build-revision's "in ram" needs are not unique. We already know that computing those "in ram" values is
expensive in some cases that can be sped up considerably with
persistent local caching.    Why not, then, make a very simple general
purpose caching mechanism?

Actually, I agree. Using a general backend for caching does make sense, if you're caching more than just archive data. And these mechanisms could even be exposed as Arch commands.

It produces only a hint about whether a get_cacherev call will likely
succeed for that revision.

Err, it's treated as though it's a guarantee. tla isn't robust against cacherevs that go missing before they can be downloaded.

Sure -- that hint can't be cached, but it
can be propogated up as a property of the cache (i.e., the
cache-access routines can offer similar hints about the expense of
producing certain values).

Well, this is what I mean about transient/permanent data. In order to build the tree efficiently, you need to know what full archive revisions are available. Cost estimates are even better -- if they're accurate. In order to know whether you can write to an archive, you need to check whether /=meta-info/mirror exists.

Instead of augmenting the archive format or building hairy new
attachments to it

I don't believe it would be hairy to simply store all available cacherevs for a given version in a common directory.

e.g.
/web/web--release/web--release--7.6.6/cacherevs/web--release--7.6.6--patch-54.tar.gz

instead of

/web/web--release/web--release--7.6.6/patch-54/web--release--7.6.6--patch-54.tar.gz

You'd then list that directory once when building a revision. You'll note a similarity to my delta proposal, of course.

problems such as that can be broken down into
hopefully re-usable subproblems, with the answers to problems and
subproblems being managed by the cache.  For example:

You cite examples of having to examine many revisions to trace
ancestry.

Perhaps I shouldn't have included ancestry tracing. I did note that the advantages dry up a lot when you can cache revision type and continuation data.

Ancestry can be traced by examining log messages: for one thing we
could be caching those or their headers.

Ancestry can't be definitively traced by looking at log messages, because import revisions can occur anywhere, but lack distinctive headers.

However, there's no reason we can't cache revision types.

Another idea might be to cache 10-revision segments of ancestry
traces, indexing them by the 10th in the list.   After at most 10
initial probes we can find those in the cache and save a lot of i/o
time reading log files.

Maybe, but just having the files as local data would be a huge leap forwards.

        if (arch_maybe_get ("address@hidden/full-tree.tar.gz"))
          we have the full-tree tar bundle
        else
          we don't


> If revision types can be stored permanently, that isn't too painful, but > since cacherevs are transient, they really drag us down. An index (e.g. > directory listing) that listed all cacherevs would allow us to scan the > archive with one server round-trip.

Hence the reason to want a simpler, less presumptive interface likst
`arch_maybe_get' and `arch_get', `arch_put', etc.

Then pick the names carefully so that you can ask for the various
parts of a revision separately.   Implementations of the cache-filling
rules for these parts can internally use the current pfs layer.

If we're trying to build a revision, we want to caching too. Re-downloading intermediate changesets sucks. But at the same time, if the data's not in the cache, we don't want to think it is.

> So my perpective is that caching at the archive level is a pretty good > match for our purposes.
These aren't necessarily incompatible.   We can extend a general
purpose cache archive-side.    I suppose the most important thing I'm
recommending is the get/put style interface using an extended arch
namespace.    Parts of the cache could conceivably (or reconceivably
already do) reside archive-side.

When I say "caching at the archive level", I don't mean caching data *in* remote archives. I mean "retaining and using local copies of data produced by archive access". e.g. by having an "archive" type that wraps another archive, storing results locally and using local data where available.

Aaron
--
Aaron Bentley
Director of Technology
Panometrics, Inc.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]