Re: [Monotone-devel] newbie question

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] newbie question - SHA1 vs serials

From:	K. Richard Pixley
Subject:	Re: [Monotone-devel] newbie question - SHA1 vs serials
Date:	Thu, 21 Apr 2005 11:37:10 -0700
User-agent:	Mozilla Thunderbird 1.0.2 (Macintosh/20050317)

Emile Snyder wrote:

Can you clarify at all what sort of support an SCM could give you to let
you have > 128 concurrent developers on one branch all churning a given
file?

I wasn't necessarily thinking of them on the same branch.

Sorry, I thought you were saying that you thought that forcing
partitioning (what I took to mean working on different branches) was a
limitation of a tool; that better SCM systems should support you working
in this mode without forcing such a partitioning.

I think that one measure of "strength" of an SCM system is the number of developers it can support using concurrent access to a single file on a single branch. By this measure, CVS with it's branch & merge approach to working directories is superior to it's predecessor, RCS, which required write-locks. And by this measure, dynamic view clearcase is superior to CVS because the working directory divergence only occurs immediately prior to working directory write rather than occurring prior to working directory read.

An additional metric for "strength" lies in the tool's ability to fan out past this initial level and the available range of such a fan out. Beyond this level of fan out, manual techniques must be employed.

I once worked in a place where there was a 2 foot high stuffed animal. You had to physically go to the cube of whoever had the animal, (it was usually pretty easy to locate), ask the current owner for the plushie, and physically carry it back to your cube before you were considered to have authority to commit to the sensitive bits of the source tree. This is an example of a tool failing to support and manual process taking up the slack.

Some possible tools for supporting fan out past the initial level are time-plexing, (ie, "taking turns" of which write-locks are one such approach), copying - ie, motion in the file name space with history recording the previously common ancestral locations, branching - ie, motion in some naming domain other than file name space with history recording the common ancestral names, geographic motion - ie, copies from one name space to another, typically with geography in between - ie, gnu emacs => gnu emacs & xemacs.

CVS "branching", such as it is, was a step in the right direction but only partially usable. CVS, (really RCS) branching was heirarchical, though it had some strange limitations and the difference between working with a collection of files and a source tree began to be more obvious with branches.

Subversion doesn't really have a branching model at all, chosing instead to conventionally encourage "branch by copy", which is pretty much available in any post-CVS system already. Specifically, I can't really ask subversion, "in which branches does this file exist?" Subversion doesn't really have any way to answer this question because it doesn't really have any way to identify branches as distinct from simple file duplication.

SVK extends the subversion model by effectively pairing the file name space with geographic ownership such that some files in the name space are conventionally owned by one site while others are conventionally owned by another.

Clearcase's branching facility is fully functional. Branch space is orthogonal to file name space. Branch space is heirarchical on a per-element basis and named based on branch pedigree. This is all fairly convenient most of the time, since the full branch identifier tells me visually and immediately of the branch pedigree of this element, though rather awkward occasionally, since multiple elements in a source tree don't necessarily branch in unison.

Monotone's ability to "load balance" multiple repositories is another step toward addressing fan out. Monotone's branching facility appears to be fully functional. Branch naming is restricted, though only conventionally. The lack of naming pedigree allows for more flexibility in branching and a somewhat simplified paradigm. While there is no obvious branch pedigree, all items in a repository effectively branch in unison. These are all properties in favor of monotones "strength".

However, monotone does have some drawbacks.

So, you're concerns about monotone scaling to this number of concurrent
developers are primarily about administrative/key management issues?  Or
something else?

Administrative/key management issues are big, yes. These are a clear problem already. This task grows geometrically on the number of users and also geometrically on the number of repositories, which is typically at least as large as the set of users. More, the repositories are not necessarily even available reliably, (could be on remote laptops). Contrast to any of the other systems which all rely on previously existing central authentication mechanisms which are already a sunk cost in most institutions and thus cost no additional administrative overhead.

In CVS, we saw problems with churn. If enough developers are checking out that common file, changing it, merging updates before committing, and then committing, at some level of fan out, the merging never completes. Every time a user attempts to merge, there are more changes to be merged and so his work flow becomes an algorythm which isn't even guaranteed to terminate. Exclusive write locks address this problem by serializing access to the repository. Rigid access control addresses this problem by serializing access to the repository and supporting manual mechanisms like the 2 foot stuffie.

Monotone has neither exclusive write-locks nor rigid access control. I'm concerned about the case where so many developers are committing, that the number of heads consistently rises. Even if developers are required to merge first, the number of heads continues to rise.

Maybe this is a good thing. Maybe this means that monotone can handle bigger spikes in churn rate than other tools.

OTOH, it leads to another problem. In any of the previous systems, a continuously high level of churn eventually leads to repository breakdown. The failure mode is obvious and the solution, namely to incorporate another fan out approach, or, when those have been exhausted, to incorporate a manual process, is also obvious.

In monotone, the failure mode is more subtle - the number of heads simply rises. And it's not yet clear to me how a company might transition from this state into any of the others.

It looks to me as though most real repositories will consistently have multiple heads. At the very least, I'm going to occasionally make changes that I, personally, consider debatable. Or debug only. Or not suitable for prime time. And I'm just going to leave them dangling. There's no real point to destroying them, even if I could. And there are strong reasons to keep them. As this number of intentionally multiple heads is likely to rise with the number of users, I think the multiple head complexity issue is likely to explode. The more complex the merging task, the fewer people who are going to be willing to dive into it, and the less frequently they are going to be willing to do so.

I don't think this represents a clear problem necessarily, at least not in the same category as the key management and distribution issue. It's simply a personal concern. I don't yet see how it will work and that concerns me.

When you say you want to be able to "state and apply per branch
authorization mechanisms" what sort of things do you have in mind? 
Stuff like "only accept revisions on branch B if they're signed by a key
in this <set of branch committer keys>" ?

Yes, that's one application.

  Do you mean on pull, so only
that stuff gets into your local db, or is it sufficient to do this on
checkout?

Ultimately I want pull too. If company has infinitely wide branching, I don't want to be downloading an infinite number of revisions every time I pull to my laptop.

* sandbox using pieces from different branches
hmmm.  at what granularity are the pieces of interest?   [...] care to elaborate?

Last week I released OurProduct-0.1 which was built using
LocalTool-4.5.  Both are located in the same repository.  Yesterday I
released OurProduct-0.2 which was built using LocalTool-4.6 and
somehow, a bug got past all of our QA groups and reached the
customer.  We suspect that the problem was introduced by LocalTool-4.6
so I'd like to try building OurProduct-0.2 with LocalTool-4.5 in order
to compare it's behavior to yesterday's release.

Assume that LocalTool has so many connections to OurProduct that I
can't just point a Makefile variable at another sandbox.


So you have a repository with two projects (un-related branches),

One product. LocalTool is a bit of code that we compile natively and then use in order to build OurProduct. It's not useful on it's own, nor is it used by any other product. (wart, from ckermit is like this).

com.yourcorp.ourproduct and com.yourcorp.localtool.  What do you mean
"too many connections"?  Until I got to that line I was thinking:

monotone --branch=com.yourcorp.localtool checkout t:4.5 LocalTool-4.5
cd LocalTool-4.5
make && make install
monotone --branch=com.yourcorp.ourproduct checkout t:0.2 OurProduct-0.2
cd OurProduct-0.2
./configure --with-localtool-dir=../LocalTool-4.5
make && make install

One product, one branch.

Do you mean that you want LocalTool to be a subdir (or subset of files
in) the OurProduct project,

Yes.

 but to version them differently?

More or less, yes. Version is just arbitrary marketing hype, right? It's just a constant with vaguely conventional meaning.

  From an
organization standpoint, I would still want to have the two projects as
detailed above, and just checkout my localtool working dir as a
subdirectory of my ourproduct working dir.  I don't know how monotone is
with checking out another project into a subdirectory of a monotone
working dir... seems like something that should work, but I haven't
played with it.

This would prohibit moving code from one into the other.

* grep the heads of all branches for given strings
this sort of stuff would be really nice.  nothing in monotone at the
moment.

In clearcase, this would probably be done by grep'ing against version
extended pathnames.

Huh.  So does clearcase store the files uncompressed, or index them as
you check them in, or what?  Curious about strategies to support this
functionality efficiently...

Clearcase in windows or very small sites is often used as a check-in, check-out repository, much like CVS or anything else. This approach is called "static views" because your sandbox only changes during the equivalent to "cvs update". (it's a "load" command in clearcase). This has also now been extended to unix in order to support CVS style disconnected development.

In clearcase in unix, aka original clearcase, aka "dynamic views", clearcase provides a real network file system interface though the kernel. When I create a sandbox, I get a recurrance of the root file system, complete with each repository mounted in it's place and I use a bit of out of band data, a "config spec" to describe the versions I wish to view. Having created a sandbox, I set a view context, which effectively adds links to the config spec info into my proc table entry such that child processes collect the same view on the file system. And then, essentially, files are looked up, uncompressed, unencryped, (and cached) at "open" time. Attempting to write to one of these files results in "read only filesystem" error, though I can create new files, (local to my sandbox), at will. To make a change, I "check out" the file, which converts the "read-only-filesystem" file into a local copy. And commit is the reverse action. Both write-locks and branch&merge paradigms are readily available. To add, remove, or move files, I check out the directory, "cleartool mkelem" or "cleartool mkdir", then commit the directory.

Dynamic views are dynamic in the sense that I see the latest thing that matches my config spec at each open. In a typical dynamic view, my builds are consistent, but two different builds may well produce different answers if, say, you have commited a change in the intervening time. If I really want to freeze myself in time, I use a config spec based on a tag or a calendar date+time.

Go ahead, ask me about "clearmake".

In a central repository system, I have some indication of who's doing
work, where, and how, from the patterns of their checkouts, branches
they've created (but not yet commited to), etc.  The best I have with
monotone is that someone sync'd.  I don't know who or where and the
what is simply the entire repository, which isn't particularly
indicative.

Huh.  Yeah, you can't get that info from monotone.  But, is it really
the useful info anyway?  Even if joe checked out branches foo, bar, and
baz I don't know if he's working on any of them.  Seems like knowing who
committed where is at least 90% of the useful info.  (And you can't
create a branch in monotone without committing to it; braches are the
collection of revisions with a given branch cert.)

That tells me about the past, not the present or the future.

I routinely find that all the files I'm interested in were last changed by people who are no longer at the company. If they were still around, then I probably wouldn't be there. So in many cases, the historical data is of limited use.

Just my opinion (and I'm by no means the expert), but it seems to me
like the one really painful missing piece is the corrected directory
handling.

Which piece is this?

monotone doesn't explicity handle directories at the moment, it just
makes them if their existence is implied by a file it's managing. 
There's various problems that this causes whose bug numbers I don't
recall off hand.

Oh, ouch! Ok, that probably knocks monotone off my current list of possibilities for now.

Apparently (although it's a bit over my head) the reworking to handle
them explicitly has some subtle issues.

Yeah. I 've been through these design issues myself many times over the last decade or two.

The best I've come up with is essentially duplicating the inode structure of the unix file system. Then you need the whole namei duplication as well, which quickly leads you to a versioning file system a la clearcase, or now subversion.

--rich

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Monotone-devel] newbie question - SHA1 vs serials, (continued)

Prev by Date: Re: [Monotone-devel] Thoughts about 'testresult'...
Next by Date: [Monotone-devel] [Script] ACL file script
Previous by thread: Re: [Monotone-devel] newbie question - SHA1 vs serials
Next by thread: [Monotone-devel] Too many heads? (was Re: newbie question - SHA1 vs serials)
Index(es):
- Date
- Thread