monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Cherry-Picking, Renames, Etc.


From: Oren Ben-Kiki
Subject: [Monotone-devel] Cherry-Picking, Renames, Etc.
Date: Sun, 28 Nov 2004 23:39:18 +0200
User-agent: KMail/1.7.1

About two years back a friend of mine and myself were bemoaning the 
state of version control systems. This was when Subversion was just 
becoming respectable (having to set up a web server just to do version 
control? you have got to be kidding!) and Arch was going through some 
turmoil. We looked hard at it and reached the conclusion that what CVS 
is to file versioning control systems, Arch is for distributed ones - 
sure, it will _work_, but painfully so. Darcs came much later - it is 
certainly brilliant, but it seems to solve a problems that just don't 
need to be solved (see below).

We spent a fun several days laying down the design for an "ideal" 
system. We came up with something _very_ close to Monotone, with a few 
notable differences, which I'd like to investigate. As a newcomer to 
this list, I'm risking repeating some old thread, so apologies in 
advance if that's the case. This is also going to be long :-)

We had SHA signatures and manifests and revisions, and "certs" and 
digital signatures... the works. We also estimated it will take the two 
of us a few months to get a solid system working, and we'd probably 
spend the rest of our lives improving it :-) As it were, we didn't have 
a few months, we were both very busy at our day job, and I already have 
a pet open-source project that eats my spare programming time... So we 
abandoned the notion. You can imagine, I was overjoyed when I came upon 
Monotone - I'm very happy to see someone else has done the heavy lifting 
for me :-)

At any rate, there are three points where we diverged from Monotone:

1. Cherry-picking:

The essence of the problem as we saw it is this. Given I have revision 
A, and somewhere in the graph, on a different branch, there exist 
revision B which evolved (possibly through several revisions) into 
revision C. Both A and B are derived from some ancient revision D. I 
want to apply the changes from B to C into my revision A to obtain 
revision E, but I don't want the changes between D and B (if I did, 
this would have been a simple merge). Weird, but sometimes useful.

    D --> B --> C
     \      vv
      \-> A ??> E

Now, if one is truly serious about making mix-and-matching "patches" as 
easy as ticking checkboxes in a form, use Darcs. That's the only system 
I've seen that _celebrates_ doing this sort of thing (shudder). My main 
problem with it is that along the way it forgot that patches are merely 
means to an end. I'm uneasy with a version control system that doesn't 
directly address *versions*.

At any rate, we spent some time thinking about it, and I came up with 
the following insight:

    If you have a good 3-way merge, you can "cherry-pick" *anything*.

(A "good" 3-way merge does not complain if the same change in both 
branches. That is, it doesn't complain that "deleting line 7 conflicts 
with deleting line 7" or "renaming foo to bar conflicts with renaming 
foo to bar").

The usual way people handle this problem is to find the common ancestor 
D of revisions B and A. Then, somehow, they detect which "deltas" on 
the path from B to C were applied on the path from D to A, and try to 
apply the rest... or something. It all gets horribly complex very fast.

What you could do, instead, is forget about common ancestors, forget 
about "deltas", and try a simple 3-way merge. Pretend your base 
revision was B. C is obviously derived from it, so that's the revision 
you want to merge into A. Now, *pretend A is derived from B*. If you 
think about it, there _is_ a path leading from B to A: going 
"backwards" all the way to D and then "forward" all the way to A.

      /-> C ---------> E
    B <-- D --> A --/

That's it. A "good" 3-way merge should result in exactly what you need. 
Representing the result in a graph is a bit of a challange, since the 
edge you just added doesn't come from a revision, it comes from a 
*path* between revisions. You'd probably want to add a dotted edge from 
B to C and have an edge from the middle of that dotted edge into 
revision E:

    D --> B --> F --> G --> C
      \    \.............../
       \           \
        \-> A ------> E

Could it be that simple? "Everybody knows" that applying patches 
requires doing messy things with histories and deltas, otherwise 
patches get applied more than once and so on. Well... It _is_ this 
simple. Let's take an example scenario to show why. Suppose I have this 
graph:

    A --> B --> C --> D
      \-> E

I now apply the "patch" B-->C to E, getting F:

    A --> B --> C --> D
     \     \.../
      \       \
       \-> E --> F

So far, so good. Now someone comes and tries to apply the patch B-->D to 
F. Surely, this means we apply the patch B--> C _again_, leading to 
havoc. Right? Wrong. What will happen is:

      /-> C --> D -------\
    B <-- A --> E --> F --> G

Now, think like a "good" 3 way merge. When B became C, some lines were 
added, so they are present in D. The same lines were added when going 
from E to F (because of the "cherry-picking" we did above), so they are 
in F. Great. No conflict to resolve. B--> was _not_ "applied twice". In 
contrast, if lines were added in C-->D, they are not present in F, so 
they will be added to G. So "only" C-->D is "really" applied, as 
expected.

The same principle holds for any other operation. As long as the 3-way 
merge is "good", there's simply no problem!

Now, was it Knuth that said, "I formally proved this program works, but 
I haven't actually tested it, so beware"? I feel the same way here. It 
_should_ work, but I haven't tried it. I'd love to, though. All it 
would take is adding:

 monotone patch --from <revision> --to <revision>

Which does a 3-way merge using "from revision" as a base, merging "to 
revision" and this one, just like a normal merge does. In fact, the 
implementation should be very similar to that of "monotone merge"...

BTW, note that the above works even if the "to" version is not derived 
from the "from" version. Not that this makes much sense... but it 
works. Usefulness aside, you could claim that Monotone allows you to 
"cherry-pick" stuff that even Darcs can't, which is quite an 
achivement :-)

2. File renames.

File renames are by far the most annoying part of having to deal with a 
version control system. If I rename a "foo" variable to "bar", I don't 
have to tell the version control about it. But if I rename a file, I do 
need to. Ugh.

What we came up with was the idea that each file has a unique id, a 
randomly generated 128-bit (or whatever) string that - and this is the 
crucial point - could be discovered from the file, somehow.

Most files in a version control system are source files. And source 
files have comments. And it is trivial to say that, if a file contains 
the pattern, say, address@hidden&*, then it has the unique id 
"alphabet-soup". And given a small utility that generates these 
strings, it is easy to embed a UID in new source files.

This way, if you remove or rename the file, the system will be able to 
know about it automatically, even if you edited the file in the new 
position (which you often do). This covers 90% of the cases. If a file 
can't have a UID comment - say it is a 3D studio MAX model, or a .doc 
file or whatever - then you resort to plan B.

Plan B is keeping, "near" the source file, a file saying which UID it 
has. This uid file should be in the same directory as the source file 
(either a ".monotone.uids" file, or .monotone/<file-name>.uid, or 
whatever). You can also use this file as a cache for extracted uids to 
prevent re-scanning of all source files. This way, if you rename a 
whole directory, the system will automatically detect it. And if you 
rename a single file, it will notice you removed it (there's file-less 
uid), and that a new uid-less file was created, so it can prompt to ask 
you whether you actually moved the file. Add a quick size comparison 
plus a diff to estimate "moving likelihhod" to make educated guessed, 
provide that as a default to the prompt, and you get "auto-magical" 
move detection.

Obviously, the manifest file will also contain the uid->path mapping, 
like today (including the SHA signature, that's still important). And 
yes, you also need a "I hereby tell you I moved a file" command. Which 
would only be used once every blue moon, if that.

Same goes for file removal, of course. As for file adding, the use of 
regexps patterns on file names goes a long way: Use the regexps to 
prevent scanning .o, .so and .a files and complain about .c files 
without a uid. Any file which does contain the UID pattern is 
automatically confirmed as a source file, and so on. Again, you also 
need a "I hereby tell you this is/not a source file", for the 0.01% 
cases you need it.

There's another, bonus advantage hidden here. Remember the 
cherry-picking problem? To solve it, we had to do a 3-way merge. And to 
do that, we had to know, for each file in my version, what is the name 
of this file in the two (from and to) patch versions. Currently in 
Monotone, this requires back-tracking from my version all the way back 
to the common ancestor, then all the way forward to the patched 
versions, because the SHA of the file could change in any step along 
the way. But if each file had a UID, this work would have been saved; 
the system would "immediately" know where each file is without having 
to bother with the common ancestor at all. For large projects, the 
saving could be significant.

Admittedly this isn't as "pure" as using SHA for a UID. But it is 
_important_ - it increases usability by an order of magnitude. 
Suddenly, the system is "telepathic". It "just knows" what you did, and 
if it isn't 100% sure it makes very good guesses, so all you have to 
say to it is "yes, you are right". This _will_ make a difference in 
people's willingness to use the system.

3. Merge tracking (a rather minor issue).

Like Monotone, we viewed the history as a DAG. Work always begins with 
some revision, so you get one edge for free. You get another edge 
leading to your new revision every time you merge it with some other 
revision. So far is common wisdom; I _think_ this is the way Monotone 
does it, although the on-line documentation is silent on the issue.

There's a subtlety involved, however, which allows keeping the DAG 
cleaner. The simplest scenario is this: Given revision A, I start 
working on revision B. I take my sweet time completing it; in the mean 
while, someone also start with A and commits revision C. If I check in 
B now, the graph will look like this:

    A --> B
      \-> C

However, suppose I merge with revision C _before_ I commit B. The graph 
will look like this:

    A --> C -\
      \-------> B

What we thought was that this extra edge from A to B in the above graph 
is redundant; it should not be shown. The general rule is: when an edge 
arrives from revision M, supress any edge arriving from an "ancestor" 
of revision M. In the above case, A is an ancestor of C, so when B 
merges with C, the graph becomes:

    A --> C --> B

Which is a cleaner and arguably more accurate description of the 
"geneology" of revision B. Once you merge C, B really become _its_ 
child, and not a child of the long forgotten A revision, regardless of 
the fact that work originally started with A.

It still makes some sense to keep track of the actual merge operations 
in the database, for the sake of tracking who did what when; but other 
than that, they serve no functional purpose. We suspected that keeping 
track of which edges were "real" would also help in some algorithms, by 
ensuring that any "fork" in the graph is a true rather than a "fake" 
fork.

How often this scenario creeps up depends on the work habits. If one 
always commits before a merge (creating revision D), this never 
happens, as the graph would always look like:

    A --> D -\
      \-> C --> B

Which is a "true" fork. But if one only commits after merging all new 
commits first (not unreasonable for a group of programmers working in 
the same office), the graph simplification could be significant.

Well, that's it... I hope you find the above useful, or at least 
coherent :-) Again, I think Monotone is on the right track, and had I 
the time I would gladly try hacking directly at the code instead of 
writing rambling posts. But I have to get back to my own project (YAML, 
if anyone is interested)...

Have fun,

 Oren Ben-Kiki




reply via email to

[Prev in Thread] Current Thread [Next in Thread]