[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposal to fix CVS binary file implementation

From: David H. Thornley
Subject: Re: Proposal to fix CVS binary file implementation
Date: Thu, 28 Dec 2000 15:22:36 -0600

"Greg A. Woods" wrote:
> [ On Thursday, December 21, 2000 at 00:22:41 (-0600), David L. Martin wrote: ]
> > Subject: Proposal to fix CVS binary file implementation
> >
> > A binary file is a binary file.  Period.  It can contain any sequence
> > of characters, the encoding and interpretation of which is solely
> > understood by the application that created it.  CVS should do nothing
> > in terms of keyword expansion or End Of Line conversion to modify a
> > binary file.  Yet, because of the behavior referenced below, CVS does
> > modify binary files under certain conditions, namely when attempting
> > to perform a merge while disabling keyword expansion, where the
> > *intent* is to apply -kk only to non-binary files.  Merging would work
> > much better if CVS did not override the local keyword expansion mode
> > for an archive that has been defined to contain binary data (-kb),
> > regardless of what the user requested on the command line (e.g. cvs
> > update -kk).
> First off, let's get one thing straight here.  Merging of a binary file
> is, by definition, impossible.  I.e. merging can *never* work for binary
> files, regardless of what keyword expansion is attempted.  That is the
> very definition of "binary-ness" -- the file's internal structure is
> opaque and no merging tool can successfully do anything logical to
> retain changes between two variants of the ancestor revision.
No, the definition of being binary is that the usual text-file
conversions should not be performed.  It is possible to have
mergeable binary files, although obviously the usual CVS merge
won't work.  A small modification to the RCS format would allow
a file type to be specified (other than "binary" and "not binary")
and a modification to CVS could allow different merge algorithms
to be selected, based on file type.

If, by "definition", you mean a definition in the universe described
by CVS up until now, you're right, but that's not the definition
of binary for a different definition of definition.  (Clear?)

Since CVS has internalized the RCS handling, and no longer relies
on an external RCS, then it is possible for CVS to modify the RCS
format in a backward-compatible way (i.e., CVS 1.12 could use
CVS 1.10.8 files, but not the reverse).

> >  If the user wants to change the "binary-ness" of a file, this should
> > be performed using cvs admin, but never "on-the-fly" based on options
> > to checkout or update.
> Well, it's not quite that simple.  A file of the same name may have very
> different content on different branches, and if binary files were to be
> supported in some way then it would be necessary to think of the
> situation where a file might be a binary on one branch, but not on
> another, for example.

It would not be necessary to allow such behavior, though.  Consider
CVS as a counterexample.  It supports binary files in some way (not
a very good way), and cannot allow one branch to be binary and one

  Indeed a "change" that might be checked in might
> convert the file from binary to non-binary, or vice versa.  I.e. in an
> ideal theoretical world, without constraints, the binariness of a file
> would be defined in the repository on a per-revision basis and could be
> instantly changed by "cvs checkin", which would imply that by default it
> must also be changed in the working directory by "cvs checkout" too.
If we were to change CVS accordingly, yes, that would be one of the
ideal changes.

> Unfortunately the current implementation using an RCS-format repository
> simply cannot possibly ever manage to represent this kind of
> per-revision file state.  RCS "keyword" flags apply to all revisions and
> all branches simultaneously.  The very notion of "per-file flags" is
> bogus and useless for any purpose in a revision control system, but
> that's what we've got to live with so long as we have RCS-format files
> for the repository.

Now, why do we have to have RCS-format files?  Unless we're willing
to give up backward compatibility, we have to support RCS-format
files, but there's no obvious reason we can't support an enhanced
RCS format.

> This is obvious.  What's not obvious is how to deal with all of the
> other issues of binary file handling.  Since CVS does not now, and
> cannot by design, properly handle binary files, the tricks of handling
> keyword expansion may as well take precedence over any concept of
> binariness.
"Cannot by design?"  CVS was not designed to deliberately exclude
binary files.  On the contrary, it has a kludge or two to allow
their use, in a half-hearted sort of way.  The design constraints
make it difficult, but not obviously impossible.
> > My CVS Christmas wish: Can we (CVS users/developers) come to a
> > consensus to devise a fix to allow keyword expansion, binary files,
> > and merging to work harmoniously "out of the box" (e.g. in a way that
> > will make it into the main CVS code line?)  I believe we have a lot of
> > developers and CVS administrators implementing a variety of
> > workarounds.  I know this has been brought up several times in the
> > past and has resulted in many a flame war.
> Well the only way to make CVS allow binary files and merging to work
> harmoniously will be to change the fundamental laws of the universe and
> introduce some real magic into the world!  :-)

It can be done, as per your next paragraph:

> Seriously the only way this can ever work is to give up on having a
> strictly RCS-based repository format.  If you're willing to throw away
> at least part of RCS for the back-end repository,


 and if you're willing
> to re-implement CVS to use a much more sophisticated database design
> that takes into account the far more complex requirements of handling
> binary data, then sure, you could do this.

That depends on how much more sophisticated and complex you want it.

Suppose, for example, that we were to expand type information in
the RCS file, so that there were multiple types, and that these
types could vary between branches.  This isn't "much more
sophisticated".  We could make CVS select different merge algorithms,
based on type.  Suddenly we can have some of this stuff work out
of the box.

It still could be clunky.  Suppose the type information was defined
based on revision number prefix:  then the way to change a type
would be to create a branch and make the change with the first
commit to the new branch.

And what are the far more complex requirements?  For binary data,
you generally don't want to do keyword expansion or line-ending
conversions, and you need to specify which diff and merge you're
using.  That's not all that complex.

  You might as well start
> completely from scratch and simply attempt to retain the same
> command-line interface and perhaps some rudimentary backward
> compatability with the client/server protocol such that older clients
> can still do basic text-only operations against a new server.
Here's where I disagree.  It could be worthwhile expanding the RCS
format to do some better handling of binary files.  It would be
possible to improve the handling of binary files while keeping
most of the code base.  It may be better to start from scratch
(in which case I think I'd change the interface also), but not

> > I think it's time for us to close the loop and implement binary file
> > support in a manner which is more merge-friendly, one which
> > accomodates both ASCII and binary files in the same merge operation
> > (where merging of binary files results in *copies* being made and no
> > actual merging - with no binary file keyword expansion or EOL
> > translation).
> Who makes the choice of which "copy" survives?  How is this choice
> reversed if the original decision is incorrect?
You could ask exactly the same questions of text files - say,
C source files (which is exactly what CVS was designed around).
If I merge a program file from a branch to another branch, and there
are conflicts, CVS asks me to resolve them.  If I screw up, well,
we've got version control here, and I can always find the
previous versions to correct a mistake.

> > Here's what I would propose (and I underscore *propose*):
> >[ ... ]
> >      OR
> >
> >      b) Change the current behavior of update and checkout to never
> >      override the archive-stored default keyword substitution mode for
> >      binary files.
> >
> I believe you proposal is somewhat naive in that it does not address any
> of the main issues of trying to manage binary files in a revision
> control system that's specifically designed to allow for concurent
> editing.
No, this proposal is not for full handling of binary files with
concurrent editing.  It is to allow slightly better handling of
binary files while using CVS.

It might be useful to have the ability to automatically set a
"cvs watch" on binary files, to take advantage of CVS's answer
to file locking.

> So, OK, you're willing to work around these issues with CVS to try to
> maintain some semblance of concurrent editing support.  Perhaps you're
> even willing to use the "cvs edit" hack and some externally imposed
> procedures and processes to prevent your users from concurrently editing
> binary files.

This seems to be fairly common, like it or not (and I know you don't
like it).  In my experience, "cvs watch" and "cvs edit" do everything
that's needed to handle the development of nonmergeable files.

> Now what about the scenario when you go to merge two branches together
> and there are conflicting changes in binary files, but where both
> changes must be retained?  Suddenly your difficulties are twice as large
> and twice as hard to fix.
Twice as large and hard as what?  You've got the choice of using
one binary or the other, or creating a new one.  Is there a
better system around to do this?

> What about the scenario where your repository has been around for a
> while and you find that users are beginning to want to re-use
> now-removed filenames, but with different attributes (eg. suddenly a
> file becomes a binary)?
That happens now.  I tell the users that it isn't going to work.

> The can of worms opened up by binary file support just gets deeper and
> wider the more you look at it.
Not unless you insist on being able to do everything with binary
files that you can with program source files, and in that case
I don't think you'll ever be satisfied.  If you're willing to
make a few compromises, then there can be greater or lesser
support for binary files without changing all that much.

> While CVS as it stands has several features which make it generically
> attractive for general-purpose revision control, it cannot be stated too
> many times that CVS is *NOT* a general-purpose revision control system
> -- it is specifically a system *DESIGNED* to handle the special case of
> file formats which can easily be merged automatically with simple
> unix-style diff; and which as a result means it can specifically target
> the needs of those who must work in environments where concurrent
> editing must be allowed and encouraged.  This DESIGN implies that it has
> constraints on its operation which prevent it from being a truly general
> purpose tool.
Unless you count language systems, there is no such thing as a
truly general purpose tool, and even with them we use special-purpose
tools for certain purposes.  There is no reason inherent in the
design why CVS could not be made to handle other needs, to a
greater or lesser extent.  In some cases, this would involve
undesirable code bloat for no proportionate gain, but that's
an argument to make on a case-by-case business.

It is unreasonable to say about a tool, "It does everything I want
it to do, and therefore should do nothing more."  It is reasonable
to argue individual issues ("Should your text editor contain
a text adventure and newsreader?"), but that is precisely what
you are not doing in that paragraph.

> Therefore if you do not like the DESIGN of CVS, and by definition the
> constraints it imposes on the resulting product, then DO NOT USE CVS,
> regardless of whatever other features it has which might make it
> attractive to you!
I carry a Swiss army knife in my pocket.  It has numerous design
constraints that effectively make sure it is not a great tool for
anything.  However, it will do for many purposes requiring a tool,
and carrying it means that I frequently have a mediocre tool for
the job right at hand, which can be much preferable to having the
right tool in the bottom of the toolbox in another room.  By
your reasoning, since it isn't the ideal tool for anything I
shouldn't carry it.

> I.e. if you want to handle binary files in a revision control
> environment then I strongly suggest that you'll be much further ahead if
> you simply take the ideas you like from CVS and start from scratch with
> a new design for a revision control system that is capable of handling
> the binary files you seem to need to handle.
Um, no.

If the CVS code base will do what somebody wants to do with some
modification, then I suggest that that person will be much better
off modifying the existing code rather than starting to build
a version control system from scratch.  I'd hoped we'd gotten
beyond the "everybody makes their own tools" idea long ago.

> Of course if you throw away the silly idea of trying to support binary
> files in CVS with an RCS-format repository, and instead focus on
> extending CVS and the RCS file format definition it uses such that a
> file type can be specified on a per-revision (or at least per-branch)
> basis.  Also devise an extension that allows deltas to be defined with
> byte or (multi-byte) character offsets instead of line offsets.  You can
> then design tools which can do logical difference comparisons of
> variants and merges of changes with specific knowledge of these file
> types.  THEN you'll have a more powerful revision control system that
> can simultaneously handle changes to many file types in an intelligent
> manner.  Such a system could even do intelligent comparisons of
> text-based source files such that changes would be recognized on a
> code-structure level instead of on a text-line level as it is today.
> This is obviously a more intensive redesign, but one which will be
> infinitely more productive than any attempt to handle binary files in
> any way whatsoever.
And here I am puzzled.

You are advocating taking CVS as it is, and extending it to
handle binary files in a much more powerful way.  In the last
paragraph, you said to throw CVS away if the previous poster
didn't like some of its design decisions.

Now, this is a more ambitious project than some that people were
suggesting, but it can be approached incrementally.  If the RCS
format is extended to allow specifying file types on a per-branch
basis, this fits into your suggestion and allows better handling
of binary files.

David H. Thornley                          Software Engineer
at CES International, Inc.:  address@hidden or (763)-694-2556
at home: (612)-623-0552 or address@hidden or

reply via email to

[Prev in Thread] Current Thread [Next in Thread]