RE: binary files bad idea? why?

info-cvs
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: binary files bad idea? why?

From:	Paul Sander
Subject:	RE: binary files bad idea? why?
Date:	Mon, 19 Jul 2004 09:52:42 -0700
>--- Forwarded mail from Greg Woods:

>[ On Tuesday, July 6, 2004 at 15:18:50 (-0700), Paul Sander wrote: ]
>> Subject: RE: binary files bad idea? why?

>And BTW, you keep waving over RCS compatability with arguments about
>minimum reproducibility that simply do not wash.  RCS compatabiltiy
>means _all_ of RCS' features, warts, and concepts.  More about the
>reasons for this below.

As long as the rcsfile(5) specification is met, then all of RCS' features,
warts, and concepts will follow.  That specification also allows specifically
for extensions to be introduced in particular ways, and RCS is written to
accomodate such extensions by ignoring them while allowing other tools that
scan RCS to store their own semantics.  There is absolutely no reason why
such tools should not be written, and any tool that cannot work in such
an environment is inherently broken because it violates the rcsfile
specification.

>In the mean time, and in the real world, RCS (and thus CVS) and all the
>tools they use and are used with them work _best_ and primarily with
>text files.  I.e. until someone provides working code that makes the
>diff/diff3 toolset used _internally_ in CVS (_and_ RCS) to be selectable
>(on a per-revision basis!!!!), there's no point to even pretending that
>non-text files can be handled generically by CVS.

This is nuts.  Differencing algorithms that rely on longest common
substrings will remain the best algorithms for storing deltas for a
long time to come, regardless of the type of data being stored.
Applying a different algorithm for every revision just won't go.

>And BTW, the point about "on a per-revision basis" is/was supposed to be
>a strong clue to you to show just how hair-brained and nearly impossible
>to achieve your ideas are.  It's also a _necessary_ requirement for both
>RCS and CVS, which manage _files_ and groups of _files_, but not the
>structure of the grouping.

You keep making the mistake of claiming that the differencing algorithm
that computes the deltas, and the differencing and merge tools that
manipulate and present the content of the revisions are inextricably
linked.  They are not.

However, I agree that in that context, the differencing and merge tools
must be compatible with the content of every revision stored in an RCS
file.  The easiest way to do that is to make sure that every revision
has the same kind of data.  The problem is that CVS doesn't make such
a guarantee at this time.

For a very long time I have argued that CVS should record in the admin
section of every RCS file the type of data stored in it, and that CVS
should poll that data type to select the proper tools to apply to provide
the diff and merge capabilities.  This is one way to guarantee that all
of the revisions, and therefore all of the combinations of data fed to
the tools, are indeed compatible.

The there's a price, and that is that no position in the filesystem
can ever be reused for a different type of data than has ever been stored
there before.  That means e.g. that README files can't change formats
between plain text, rich text, MS Word, or whatever formats.  This is
clearly an unacceptable condition as long as software developers wish
to evolve their designs.

It is possible to meet both requirements:  Guarantee every revision
in an RCS file to contain the same data type, and allow the users to
make arbitrary changes to their source trees.  And the way to do that
is to change the way that CVS maps files in the user's sandbox to the
RCS files in the repository, so that at any given time the working file
maps to the correct RCS file but the correct RCS file may be different
at different points in the project's lifetime.  That same change also
enables other things, like the ability to genuinely rename a file.

Therefore, to accomodate multiple data types, it is in fact a _necessary_
requirement for CVS to track the file structure in addition to the content
of each file.

>The main idea of change management is to capture and identify _changes_,
>not to record exact replicas of specific revisions across time.  The
>latter comes from the former, not the other way around.  Changes are
>best specified as the the edit actions that were done to make them.  Why
>do you think it is that deltas are stored as edit scripts in both RCS
>and SCCS files?  I'll tell you for certain it wasn't just because there
>were already well known algorithms (and ready implemented tools) to
>create and make use of those edit scripts (though that was of course a
>big part of it).

First of all, the motivation for storing deltas was for storage
efficiency.  This goes all the way back at least to Brooks ("The Mythical
Man Month").  Second, SCCS doesn't store edit strings; it stores interleaved
deltas which are more akin to #ifdef constructions.

The deltas stored in RCS are not the same as the edit actions that the
user took.  In the case of RCS, they're approximations, but when examined
in a context that understands the semantics of the content, e.g. a C
source file, the deltas are gibberish.  Take for example the following
files.

File aa1:
if (some condition) {
  action 1
  action 2
}



File aa2:
if (some condition) {
  action 1
  action 3
  action 2
}
if (another condition) {
  action 4
  action 2
}

Clearly, I have made two changes to create aa2 from aa1:  I added
action 3, and added the second condition.  And yet, here's what diff
says I did:

2a3,7
>   action 3
>   action 2
> }
> if (another condition) {
>   action 4

Now, if I were to try to do true change management, I would use
differences that are sensitive to the semantics of the content,
which would in fact represent much more closely what I actually did
to these files.  I would not use the differences produced by a tool
that relies on longest common substrings.  But in the absence of
tools that understand what I really mean, I carry around sets of
revisions (and not just the results of my changes, but the before- and
after-images), and run my computations on the raw data, rather than
relying on these broken approximations.  In practice, I really carry
around 4-tuples:  The identity of the history container, the location
of the file in the filesystem, the predecessor version, and the final
version.  (The reason for the two versions is because the change might
be implemented over multiple sequential commits to the same file.)
And given these 4-tuples, I can apply the proper context-sensitive
tools to provide _good_ change management.

Because RCS is very good at reproducing the raw data that I need,
it fits my purposes very well.  And I don't need or want to look
at its innards to see how it represents my data.  It's useful if
I can store additional meta-data in the RCS files that my larger
process can understand, and I can do that while maintaining full
compatibility with RCS and all its features and warts and whatnot
just by adhering to the rcsfile specification.

>If you want to capture the essense of the changes made to binary files
>then you _must_ capture the actions used to effect those changes no
>matter what form those actions take -- that's the whole idea.  In
>computer science one of the most logical, and most widely used, ways to
>do that is to design a human/computer language that can specify the
>binary file's internal structure and content.  Once you have done that
>then it's trivial to use the text editing and comparison tools we
>already have to capture and record and document the changes in the
>binary file by capturing the edits made to the text file(s) containing
>the instructions in the language used to create the binary file.  And
>what do you know but that's _exactly_ what a programmer does when
>writing and modifying a program in a compiled language such as C (or
>even a CAD operator using a drawing tool such as xfig which is just a
>complicated GUI editor for a human/computer language used to define and
>describe drawings).  Note though that most of us (the sane ones amongst
>us :-) don't bother to record copies of revisions of the intermediate or
>product files created from our source code, nor the deltas between their
>revisions, because the changes between those revisions of binary files
>are meaningless (and besides we can reproduce those intermediate and
>product files on demand assuming we've also captured enough of the
>relevant information about our build process within the rest of our
>software configuration management environment).  Even the changes
>between revisions of some text files that are created from other text
>files are meaningless (e.g. PostScript generated from Troff or Lout; or
>"configure" scripts generated from "configure.m4" sources), and so
>storing those text-form intermediate files, especially in any scenario
>where they might ever have to be merged, is ludicrous and wasteful and
>pointless.

I agree with most of what you said in the above paragraph, except for two
things:  It's not "most common" to have a written human/computer
language that specifes the structure of a binary file (at least, it's
not common enough and it's unlikely to be common enough anytime soon),
and the text-based tools we already have won't necessarily work even
with a text-based representation of the data (at least not in a meaningful
way).

First, there are so many tools out there that have proprietary formats
and proprietary editors, provided by companies and individuals who do
not find theemselves compelled to do what you think is the logical
thing.  Some of us have to live with these tools either through policy
or because they're the only tools available to us to accomplish the
tasks we need to accomplish.

Second, tools such as diff were designed for free-form text, not for
structured data.  As such, they will not produce meaningful results
when applied structured data, as I've demonstrated above.  Therefore,
dedicated diff and merge tools are needed for the data anyway, regardless
of whether or not the data can be represented as pure text.  In any case,
structures like the popular object-oriented graphics formats are best
presented as pictures than as text, and the users will be more productive
when using dedicated picture-based merge tools than when repeatedly using
editors and renderers to perform merges of their images.  Even if you were
to write Logo programs to draw your graphics, a dedicated merge tool that
understands the syntax of Logo is a much better tool to use than diff3.
The same concept applies to all structured data, which is why syntax
directed editors are becoming so popular.

>I.e. the mere idea of wanting to store a binary file in a change control
>system, especially a generic one like CVS, and most especially one
>that's designed explicitly and specifically for the primary goal of
>supporting concurrent editing, is very wrong and completely nonsensical
>from the get go.  If all you have are binary files then _any_ other and
>_all_ other version management tools are better than CVS (and RCS and
>SCCS and anything else using diff, diff3, & patch).  One sure as hell
>doesn't need to use RCS files to store revisions of binary files if the
>deltas between those revisions are meaningless to most any human when
>they're presented to that human in the format they're stored in (i.e. in
>the RCS internal form).  If anyone's going to go to the trouble of
>trying to record and document revision history of any set of files then
>they'd be outright stupid to use an inappropriate tool to do so,
>especially if their sole reason for using the wrong tool was simply that
>it was what they had at hand or what they happened to already know.

The actual requirement of the concurrent editing paradigm is that a
viable merge tool exist for the type of data being edited.  The storage
engine of the versions doesn't matter, and neither does the internal
structure of the data, as long as the proper merge tools are selected
at the proper times and the prerequisites of the merge tools can be met.
The prerequisites are easily met by reproducing the original versions,
which the likes of RCS and SCCS do very well.  And selecting the proper
tool is easily done, provided certain guarantees are made.

>Now if you really wanted to make some progress in computer science and
>software engineering technology then you'd think about designing and
>implementing tools that could identify and capture a more expressive
>form of edits by comparing to copies of a text file (*), instead of just
>continuing to blow your horn about storing in a change control system
>such as CVS what should always be considered to be intermediate and/or
>product files.

Well, CVS is not a change control tool; it's a version control tool.
There's a difference, though change control can't exist without version
control.

>(*) e.g. how about inventing an extension to existing diff/merge
>algorithms that could spot identifier and word substitutions
>(e.g. variable renames, etc.) just by textual comparison of a revised
>file with its ancestor.  If you could compress a variable rename where,
>say, 25% of the lines of a file are changed as a result down into one
>single edit command then you'd do wonders for conflict avoidance in
>merges where such edits would certainly conflict with other variable
>renames done on other branches, not to mention structural changes, and
>so on.

This is the wrong direction to go.  To find variable renames, you need
to know the language you're writing.  Variable X is different in procedures
A and B, and renaming X may be correct in only one or the other and rarely
in both.

If a tool were to create a parse tree from the file, then apply
transformations like "rename variable X in this procedure" where the
procedure is encompassed by a subtree and the declaration can be found
within, then you'd have something.  But this simply can't be done with
tools based on a longest common substring differencing algorithm; you
need a hierarchical differencing algorithm instead.  Such algorithms
exist, but they are not yet in widespread use.

Also, if a global data structure is renamed, then the tool should
apply that change across the entire project, too.  But that raises many
other problems...

>--- End of forwarded message from address@hidden
[Prev in Thread]
Current Thread
[Next in Thread]
RE: binary files bad idea? why?, (continued)
Prev by Date: "end of file from server" error
Next by Date: Re: binary files bad idea? why?
Previous by thread: Re: binary files bad idea? why?
Next by thread: Re: binary files bad idea? why?
Index(es):
- Date
- Thread