[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposal to fix CVS binary file implementation
Greg A. Woods
Re: Proposal to fix CVS binary file implementation
Thu, 28 Dec 2000 14:12:06 -0500 (EST)
[ On Thursday, December 21, 2000 at 00:22:41 (-0600), David L. Martin wrote: ]
> Subject: Proposal to fix CVS binary file implementation
> A binary file is a binary file. Period. It can contain any sequence
> of characters, the encoding and interpretation of which is solely
> understood by the application that created it. CVS should do nothing
> in terms of keyword expansion or End Of Line conversion to modify a
> binary file. Yet, because of the behavior referenced below, CVS does
> modify binary files under certain conditions, namely when attempting
> to perform a merge while disabling keyword expansion, where the
> *intent* is to apply -kk only to non-binary files. Merging would work
> much better if CVS did not override the local keyword expansion mode
> for an archive that has been defined to contain binary data (-kb),
> regardless of what the user requested on the command line (e.g. cvs
> update -kk).
First off, let's get one thing straight here. Merging of a binary file
is, by definition, impossible. I.e. merging can *never* work for binary
files, regardless of what keyword expansion is attempted. That is the
very definition of "binary-ness" -- the file's internal structure is
opaque and no merging tool can successfully do anything logical to
retain changes between two variants of the ancestor revision.
Not until the very end of your message to you hint that you understand
"merging of binaries" to mean "selecting between alternate revisions".
> If the user wants to change the "binary-ness" of a file, this should
> be performed using cvs admin, but never "on-the-fly" based on options
> to checkout or update.
Well, it's not quite that simple. A file of the same name may have very
different content on different branches, and if binary files were to be
supported in some way then it would be necessary to think of the
situation where a file might be a binary on one branch, but not on
another, for example. Indeed a "change" that might be checked in might
convert the file from binary to non-binary, or vice versa. I.e. in an
ideal theoretical world, without constraints, the binariness of a file
would be defined in the repository on a per-revision basis and could be
instantly changed by "cvs checkin", which would imply that by default it
must also be changed in the working directory by "cvs checkout" too.
Unfortunately the current implementation using an RCS-format repository
simply cannot possibly ever manage to represent this kind of
per-revision file state. RCS "keyword" flags apply to all revisions and
all branches simultaneously. The very notion of "per-file flags" is
bogus and useless for any purpose in a revision control system, but
that's what we've got to live with so long as we have RCS-format files
for the repository.
> We need to decouple, in concept and in implementation, "binary" and
> "keyword expansion mode". The binary nature of a file (which mandates
> no EOL translation and no keyword substitution) is an immutable
> attribute of the file which must always take precedence. It should
> not be adjustable using checkout or update. I cannot think of any
> circumstance where a binary file would ever need to be transiently
> defined as anything different.
This is obvious. What's not obvious is how to deal with all of the
other issues of binary file handling. Since CVS does not now, and
cannot by design, properly handle binary files, the tricks of handling
keyword expansion may as well take precedence over any concept of
> My CVS Christmas wish: Can we (CVS users/developers) come to a
> consensus to devise a fix to allow keyword expansion, binary files,
> and merging to work harmoniously "out of the box" (e.g. in a way that
> will make it into the main CVS code line?) I believe we have a lot of
> developers and CVS administrators implementing a variety of
> workarounds. I know this has been brought up several times in the
> past and has resulted in many a flame war.
Well the only way to make CVS allow binary files and merging to work
harmoniously will be to change the fundamental laws of the universe and
introduce some real magic into the world! :-)
Seriously the only way this can ever work is to give up on having a
strictly RCS-based repository format. If you're willing to throw away
at least part of RCS for the back-end repository, and if you're willing
to re-implement CVS to use a much more sophisticated database design
that takes into account the far more complex requirements of handling
binary data, then sure, you could do this. You might as well start
completely from scratch and simply attempt to retain the same
command-line interface and perhaps some rudimentary backward
compatability with the client/server protocol such that older clients
can still do basic text-only operations against a new server.
> A frequent argument against changing is that CVS was not designed to
> handle binary files. This may be true, but the introduction of the
> -kb option tends to prove a willingness and desire in the CVS
> development and user community to accomodate binary files.
That's a completely invalid argument -- your conclusion is invalid. The
introduction of '-kb' comes from RCS, not CVS, and in no way implies
that CVS can make any more than the most rudimentary use of it. Just
because RCS has a feature doesn't mean CVS uses it in the way you might
logically conclude from its use in RCS (eg. branching, locking, keyword
expansion, state fields, etc., etc., etc.).
> Many (myself included) have implemented procedures or written scripts
> to effectively exclude binary files from the merge operation, or to
> perform some pre- or post-processing on the files or archives used in
> the merge to correct the problems encountered using cvs update -kk.
Either you have a different concept of merging, or you have not done
exactly as you claim you have.
> Others may construct their repositories so that binary files live in
> their own directories in a sort of "binary prison", apart from the
> ASCII source files, so that the binary files may be more easily
> excluded from merge operations. I don't think this is a good solution
> because CVS then dictates repository structure, even when cohesive
> functional grouping may dictate that ASCII and binary files should
> coexist in the same directory.
Well it's the only logical solution given the constraints of the tools
at hand..... The effect of this solution on repository structure is no
where near as important as you seem to make it out to be.
> I think it's time for us to close the loop and implement binary file
> support in a manner which is more merge-friendly, one which
> accomodates both ASCII and binary files in the same merge operation
> (where merging of binary files results in *copies* being made and no
> actual merging - with no binary file keyword expansion or EOL
Who makes the choice of which "copy" survives? How is this choice
reversed if the original decision is incorrect?
> Here's what I would propose (and I underscore *propose*):
> 1) Maintain the current keyword expansion modes, as persisted in the
> archive or in the local working area in the Entries file as "kv, kvl,
> k, o, b, or v";
> 2) EITHER:
> a) Provide a new command line keyword expansion option "-km" on
> cvs update and cvs checkout to support merging. The effect would
> be that the working area local keyword substitution mode would
> overridden to "k" for all but binary files, which would remain
> b) Change the current behavior of update and checkout to never
> override the archive-stored default keyword substitution mode for
> binary files.
> Any comments? Wait, let me put on my Kevlar heat-resistant suit
I believe you proposal is somewhat naive in that it does not address any
of the main issues of trying to manage binary files in a revision
control system that's specifically designed to allow for concurent
Think about it: You've got some changes you are about to commit that
include changes to a file which you've tagged as un-mergable (i.e. it is
a binary, opaque, file). As you run "cvs commit" you discover that
someone else has simultaneously made changes to that file. Now what?
You can't even use "cvs diff" to find out what the heck they did! You
can only guess by investigating their revision comments and/or by asking
them out-of-band. If the file has some structure that's visible in some
other medium than a text editor (eg. it's a JPEG) then you can perhaps
visually compare your revision, the ancestor revision, and the other
person's new revision.
So, OK, you're willing to work around these issues with CVS to try to
maintain some semblance of concurrent editing support. Perhaps you're
even willing to use the "cvs edit" hack and some externally imposed
procedures and processes to prevent your users from concurrently editing
Now what about the scenario when you go to merge two branches together
and there are conflicting changes in binary files, but where both
changes must be retained? Suddenly your difficulties are twice as large
and twice as hard to fix.
What about the scenario where your repository has been around for a
while and you find that users are beginning to want to re-use
now-removed filenames, but with different attributes (eg. suddenly a
file becomes a binary)?
The can of worms opened up by binary file support just gets deeper and
wider the more you look at it.
While CVS as it stands has several features which make it generically
attractive for general-purpose revision control, it cannot be stated too
many times that CVS is *NOT* a general-purpose revision control system
-- it is specifically a system *DESIGNED* to handle the special case of
file formats which can easily be merged automatically with simple
unix-style diff; and which as a result means it can specifically target
the needs of those who must work in environments where concurrent
editing must be allowed and encouraged. This DESIGN implies that it has
constraints on its operation which prevent it from being a truly general
Therefore if you do not like the DESIGN of CVS, and by definition the
constraints it imposes on the resulting product, then DO NOT USE CVS,
regardless of whatever other features it has which might make it
attractive to you!
I.e. if you want to handle binary files in a revision control
environment then I strongly suggest that you'll be much further ahead if
you simply take the ideas you like from CVS and start from scratch with
a new design for a revision control system that is capable of handling
the binary files you seem to need to handle.
Of course if you throw away the silly idea of trying to support binary
files in CVS with an RCS-format repository, and instead focus on
extending CVS and the RCS file format definition it uses such that a
file type can be specified on a per-revision (or at least per-branch)
basis. Also devise an extension that allows deltas to be defined with
byte or (multi-byte) character offsets instead of line offsets. You can
then design tools which can do logical difference comparisons of
variants and merges of changes with specific knowledge of these file
types. THEN you'll have a more powerful revision control system that
can simultaneously handle changes to many file types in an intelligent
manner. Such a system could even do intelligent comparisons of
text-based source files such that changes would be recognized on a
code-structure level instead of on a text-line level as it is today.
This is obviously a more intensive redesign, but one which will be
infinitely more productive than any attempt to handle binary files in
any way whatsoever.
(BTW, it would be really nice if you could post with either an ASCII or
ISO-8859-1 charset, and make sure your paragraphs are nicely wrapped to
fit on an 80-column display, and whatever you do NEVER, EVER, post MIME
multi-part alternative HTML!!!!)
Greg A. Woods
+1 416 218-0098 VE3TCP <address@hidden> <robohack!woods>
Planix, Inc. <address@hidden>; Secrets of the Weird <address@hidden>