[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ISO version control file merge tool (merge RCS/CVS ,v files)

From: Andy Glew
Subject: ISO version control file merge tool (merge RCS/CVS ,v files)
Date: 22 Dec 2004 17:03:02 -0800
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.1

---++ VC Merge

So here is what I want: given two version control files,
supposedly for the same file,
merge them.

The merge may range from trivial to fancy.
All merges should preserve all history and comments;
however, some merges may do better than others at recognizing commonality
between ostensibly different versions.
Similarly, some merges may be more space efficient than others

---++ Commonalities and Differences between VC Tools

I'll talk about this as if it is a tool that merges CVS/RCS ,v files.
I believe it should generalize fairly well to other VC tools,
but what I need now is CVS. CVS is my legacy VC system.

Pretty much all tools understand
   * files
   * versions of files
   * branches
   * check in comments

Some tools handle groups of files together, but I don't think this matters
to the discussion here.

Some tools don't really have a concept of a branch
- notably Subversion, where a branch is just a different directory
in the Subversion "filesystem".
However, in Subversion you can trace ancestry, and figure out
branching, although it can be hard to distinguish branching in the Subversion 
VC system
from moving a file to a different location in irectory hierarchy.

Some tools understand more complicated graphs than simple branching:
they can track cross-links, branches that merge back, etc.
I think this can be handled.

Some tools get confused if a filename has different types in different versions 
- e.g. CVS, change from a binary to a text file.
Other tools handle this - e.g. MetaCVS, giving a file a unique name at initial 
If you delete and then re-add a file, it gets a new unique name.
Some tools totally separate names from content - e.g. Monotone.

For all tools, it should be possible to write code that traverses the 
tree, handling branches depth-first, bread-first, or whatever.

---+++ Trivial Merge - No Sharing

The most trivial merge replicates all branches, and doesn't try to share 




Output: union of all branches and versions
     |                  +->v1.1.1.1->v1.1.1.2  

This exposes some minor issues:
   * in a CVS-like system, who gets the main branch
      * A: option. Default nobody
   * what about tags and labels
      * Default: all made unique by, e.g. adding a prefix.
      * Advanced: may want to recognize sharability

---+++ Types of Sharing

Same data, same version:
In the above example, repo1/A v1.1, v1.2, may be the same as
repo2/A v1.1 and v1.2.  

Same data, different versions:
repo1/A v1.3 may be the same as repo2/A v1.4.
i.e. the main branch of repo1 may have skipped a version
that was on repo2's main branch.
(It's just accidental that I say main branch for this

---+++ Trivial Merge - Recognize Sharing

A trivial merge might recognize sharing, but not change the trivial
concatenation of all branches.
E.g. it would still have the union of all branches and versions,
but might add comments indicating commonality.

   * fileA
      * repo1branch
         * main branch (v1.*)
            * v1.1
               * v1.1.1.* branch
                  * v1.1.1.1
                  * v1.1.1.2
            * v1.2
            * v1.3
      * repo2branch
         * main branch (v1.*)
            * v1.1
               * same as fileA/repo1branch/v1.2
               * v1.1.1.* branch
                  * v1.1.1.1
                     * same as fileA/repo1branch/v1.1.1.1
            * v1.2
               * same as fileA/repo1branch/v1.2
            * v1.3
            * v1.4
               * same as fileA/repo1branch/v1.3

In this example, I only indicated when files from repo2 were the'same as files 
from repo1;
i.e. I only indicated one direction.
It could be made bidirectional.
These "same as" comments might just be that - comments.
However, they could also be links understood by the version control
tool, to reduce the amount of data stored.

Different version control tools store different amounts of data.
Subversion, for example, at one time stored a full version at the head of a 
This wasted space.
Monotone, however, will only store any given set of bytes once
- Monotone is content based.

Thus, recognizing sharing may save space.
However, more important is what recognizing sharing does to the user perception
of history.

---+++ Common Ancestor Tree Merge

A basic form of tree-oriented merge would not replicate branches
from repositories until there is a difference.
At the point of divergence, it would provide the union of all diverging 

   * fileA
      * main branch (v1.*)
         * v1.1
            * common to both repo1 and repo2
            * v1.1.1.* branch
               * v1.1.1.1
                  * common to both repo1 and repo2
                  * repo1branch
                     * v1.1.1.2
            * v1.2
               * common to both repo1 and repo2
               * repo1branch
                  * v1.3
               * repo2branch
                  * v1.3
                  * v1.4
                     * same as A/repo1branch/v1.3

This is better, but it still leads to excess diverginmg branches.

---+++ Extending Linear Tree

If one repository has a linear tree, and the other repository purely
extends it, we don't need to create a new branch - we can just extend
the branch common to both.

where v1.1-v1.3 are common to both.

We can then extend


adding the comment that v1.4 was found in repo2 but not repo1.

This can obviously be applied all over a tree.

---++++ ISSUE: Multiple Conflicting Linear Extensions

Unfortunately, this approach, applied pair-wise, may not give us the
best result if we want to merge 3 or more repositories:


Say that repo2/.../v1.4 and repo3/.../v1.4 are different.  We may
prefer to get

                                +-> repo2branch -> v1.4
                                +-> repo3branch -> v1.4

rather than 

   repo1->fileA->v1.1->v1.2->v1.3->v1.4 (repo2)
                                +-> repo3-> v1.4

---+++ Skips in a Linear Tree

I'm going to use letters to indicate contents:


i.e. repo1's v1.3 is the same as repo2's v1.4

This might have arisen if repo2 was were development was performed,
and repo2/v1.3 was not checked into repo1.
In which case we probably want the merge to look like

suitably commented.

---++++ Skips indicating Retrospective Branching

However, it might also be that repo2/v1.3 was really a dead-end, and
that it was abandoned, and picked up again.  In which case merging the
branches into a single branch is not what we want.  In fact, we
probably would have liked to have a different original tree structure

suitably commented to indicate that the version marked 
1.3/1.4 was 1.3 in repo1 and 1.4 in repo2.

I call this _retrospective_ _branching_, and it has long been a feature I 
wanted in CVS.
You realize, after you have checked in several versions on the main branch,
that you should really have been off on a task branch.

I don't think the merge tool should be burdened with
retrospective branching.  
However, if we have a framework for manipulating version control files,
then retrospective branching could be handled by a different tool.

If the version control structure itself is changed, however,
then that itself should be version controlled.
It might be necessary to use Monotone like content oriented concepts to
version the versioning itself. See <nop>VersioningTheVersioning below.

Finally, a pattern that I have observed often reveals retrospective branching:
I.e. the change Y' in 1.3 is backed out;
the old version v1.2 containing X is moved to the front of the branch as v1.4;
and then development resumes.

---+++ Fancier Graph Commonalities

Some tools provide cross-links between branches.
CVS does not.
I don't think it's a major difficulty.

If trivial merging without sharing has been done,
the cross-links can be preserved almost verbatim,
except for the usual changes to ensure uniqueness of names.

If a node that was cross-linked to in oe repo was not cross-linked to in
the other, the cross-linking may be open to question,
especially if it has semantic value, such as controlling future merges.
But it can probably still be done, suitably commented.

Some cross-links indicate progress of two branches tracking each other.
This might become confused if, for example, the first
cross-link was on the main branch,
but the second cross-link was on a branch automatically created
by the merge tool when repositories have diverged.
It can probably still be done, suitably commented;
or, it may be necessary to change the type of such links.

Since CVS does not have such links, no worries for now.
Although some people use naming conventions to record such links
in CVS's tags.

---+++ Recognizing Commonality

---++++ Recognizing Version Commonality

Use Monotone like content based filesystem approaches.
Create unique names based on content.
Most generally, this is content, but hash codes are shorter.
Monotone assums hash codes never collide; I prefer to allow for collision,
e.g. by using an instance number after thye hash code.

Using content allows us to find all whole object commality, and 
automatically generate comments for the same.
Having found such maximal commonality, we can then apply sharing algorithms to 
the tree.

PROBLEM: this approach, implemented directly, assumes that all versions of all
files in two repositories that are to be merged are instantiated,
so that the hash codes can be calculated. This may be expensive.

Recording the hash codes in the version contents may help.

If the hash codes can be calculated incrementally, from an old version
and a patch, this may be faster. However, most modern crypto hashes cannot be 
incrementally, by design.

If ancestor versions from two trees have been expanded
and found to be identical (by hashing and possibly with full diff)
then we can use the patches on the version tree to find commonalities
wthout full expansion.
However, when patches diverge, we may not be able to recognize reconvergence.
I.e. following the patch tree helps find much commonality,
but not all.
Content based idenification is necessary to find all full file commonality.

---+++++ Trivial Differences

Things such as RCS/CVS's $<nop>Head<nop>$ and $<nop>Log<nop>$ variables
may produce trivial differences.

Apart from recommending that such things not be used, I do not have 
any great plans to handle them in a merge tool.

Elsewhere and previously, I have discussed how to handle such trivial
differences, in particular for automatically generated text, using
very much the same techniques that I propose to debug macro expansion.
But I don't think it's worth it for a VC merge.

---++++ Recognizing Patch Commonality

If we look at the patches, or diffs between versions on the tree
(typically adjacent versions)
we may be able to recognize some commonalities.

If exactly the same patch has been applied on two branches,
it may be worth noting that in the merge commentary.

However, patches can vary in detail although conceptually identical.
For example, some patches include absolute line numbers;
even context diff patches may use the line number as a starting
point for a search.

I don't think I want to propose that a VC diff tool search for deep
patch commonalities.

---+++ Sharing and Uniqifying Names, etc.

Tag names, branch names, etc., may need to be initially uniqified, and then 
shared in much the same way as versions.

---+++ Re-version-numbering

The merge will almost undoubtedly change some, if not many or all,
version numbers.

The only real question is whether we should seek to change maximally,
or whether we should minimize changes by leaving old versions as they are
if common to all mergees.

---+++ Usage Models

---++++ Merging 2 Largely Related Repositories

I anticipate that the most common usage model will be to
   a. copy the full repository
   b. edit disconnectedly
   c. merge
The mergees can be expected to have much commonality, share most things;
the divergences should be brief.

---++++ Merging An Ad-Hoc Repository

A second usage model might be
   a. copy a file, without VC info
   b. create a new VC repository for that file
   c. edit/checkin cycle
   d. now merge the new VC repository with the original

Here, the old repository will be big,
but the new small.

It would be desirable to figure out where, in the old VC history,
the new one started - and then merge the new VC as a subtree of
the old.

Two separate operations:
   * finding subtree base
   * merge subtree

Merging the subtree is a straightforward modification of what we have described 

Finding the subtree base:
   1. specify manually 
      * e.g. "I checked out v1.2
   2. use content based uniqueness, e.g. hash code,
      to find ancestor anywhere on graph
      * note: often don't check in the base version, just the first change

---+++ Some Desirable Properties

for some settings of command line switches,
the merge should commute.
   vc-merge repo1/fileA,v repo2/fileA,v > 12,v
   vc-merge repo2/fileA,v repo1/fileA,v > 21,v
   is-empty diff 12,v 21,v 
This will not always be true, because of choices of which
repo keeps the man branch by default.
But it should be true in some cases,
e.g. union of all branches.

for some settings of command line switches,
   vc-merge repo1/fileA,v repo2/fileA,v > 12,v
   vc-merge 12,v repo2/fileA,v > 122,v
   is-empty diff 12,v 122,v 
Trivially performing the union of all branches merge will
result in a new subtree being added each time.
This is undesirable.
Recognition of commonality should also recognize when the
same merge has been completely done.

---+++ <nop>VersioningTheVersioning

<nop>VersioningTheVersionControl is thye sort of thing required for
retrospective branching. It is also useful for tracking VC merges.

MetaCVS uses the concept of a map, from VC objects to positions
in the filesystem.

Monotone uses such maps not only for filesystem position, but also
for version control position.

Such maps would allow use to "version the version control".
E.g. the trivial merge might give us

   * fileA
      * repo1branch
         * main branch (v1.*)
            * v1.1
               * v1.1.1.* branch
                  * v1.1.1.1
                  * v1.1.1.2
            * v1.2
            * v1.3
      * repo2branch
         * main branch (v1.*)
            * v1.1
               * same as fileA/repo1branch/v1.2
               * v1.1.1.* branch
                  * v1.1.1.1
                     * same as fileA/repo1branch/v1.1.1.1
            * v1.2
               * same as fileA/repo1branch/v1.2
            * v1.3
            * v1.4
               * same as fileA/repo1branch/v1.3

But applying a commonality sharing algorithm might give us:

   * fileA
      * main branch (v1.*)
         * v1.1
            * common to both repo1 and repo2
            * v1.1.1.* branch
               * v1.1.1.1
                  * common to both repo1 and repo2
                  * repo1branch
                     * v1.1.1.2
            * v1.2
               * common to both repo1 and repo2
               * repo1branch
                  * v1.3
               * repo2branch
                  * v1.3
                  * v1.4
                     * same as A/repo1branch/v1.3

Both of these merges, the trivial and the shared, could be checked in
as versions of the map.

Manual editing of the version control might be acceptable,
e.g. for retrospective branching.

More comprehensive manual editing *might* be desirable.
E.g. fixing typos in version checkin comments.
Possibly deleting, to all appearances, no longer necessary branches;
i.e. pruning the VC history to appear more sensible.
(In my fascist mode I insist that we would never actually
delete versions, the version contents would still appear,
and they would appear in some map. But they would not clutter the default map.)

We might adopt heuristics such as allowing VC map edits that do not
eliminate any files from the map - retrospective branching would be one such.
Higher power would be needed to actually delete.

The concept of "pruning" the map, I think, could be dangerous.
It might lead to trying to keep two maps,
the unpruned and pruned. Such duplication of effort would be bad.
It would be better for such pruning to be accomplished
via conditions:
   * by default display only the versions and branches marked interesting
   * optionally display the full history

Andy Glew
PREFERRED EMAIL: address@hidden
FALLBACK EMAIL: address@hidden


Potential bias: employed now by Intel
        past by AMD, Intel, Motorola, Gould ...
This post is personal, and is not the opinion of 
        any of my employers, past or present.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]