bug-diffutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-diffutils] New diff option to compare file/directory properties


From: Duncan Moore
Subject: Re: [bug-diffutils] New diff option to compare file/directory properties
Date: Thu, 14 Oct 2010 16:47:38 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4

Hello Andreas

Thanks for the feedback.

On 11/10/2010 18:11, Andreas Gruenbacher wrote:
I think we should extend the diff file format so that additional information
can be included.  Patches should still remain backwards compatible as much as 
possible, though.

I'm not sure we can have existing implementations of 'patch' working with the new 'diff --compare=...' output format, unless we have all the differing properties on a single line, so I think it may be best to forget about backwards compatibility in this sense. Of course, if the --compare=... option isn't used, then there is no change to the 'diff' output, and no backwards compatibility issues.

Firstly, I'll just state some of the points that the updates were intended to address, taken from the 'diff' info manual:

Section 18.1 - One should be able to use gnu 'diff' to generate a patch from any pair of directory trees, and given the patch and a copy of one such tree, use patch to generate a faithful copy of the other.

Section 18.1.2 Handling Changes to the Directory Structure.
'diff' and 'patch' do not handle some changes to directory structure. For example, suppose one directory tree contains a directory named ‘D’ with some subsidiary files, and another contains a file with the same name ‘D’. ‘diff -r’ does not output enough information for 'patch' to transform the directory subtree into the file.
There should be a way to tell 'patch' that a file’s time stamp has changed, even if its contents have not changed.
These problems can be fixed by extending the 'diff' output format to represent changes in directory structure, and extending 'patch' to understand these extensions.

Section 18.1.3 Files that are Neither Directories Nor Regular Files
Some files are neither directories nor regular files: they are unusual files like symbolic links, device special files, named pipes, and sockets. Currently, 'diff' treats symbolic links like regular files; it treats other special files like regular files if they are specified at the top level, but simply reports their presence when comparing directories. This means that 'patch' cannot represent changes to such files. For example, if you change which file a symbolic link points to, 'diff' outputs the difference between the two files, instead of the change to the symbolic link.
'diff' should optionally report changes to special files specially, and 'patch' should be extended to understand these extensions.

Additionally, 'diff' is not a tool for just creating input for 'patch', it's also for allowing the user to easily see what's different in file contents. The aim was to carry this principal over for other file properties, so that the user can selectively compare any file properties, either with or without the traditional file contents difference. This would be separate from it's use in creating patch files, although the output format would be the same.

We also need to define what diff files are supposed to be used for: for 
example, they currently contain file modification times.  Those timestamps are 
usually ignored when applying a diff (or left out in the first place).  It is 
not entirely clear which additional information to include by default (on the 
diff side), and which information to ignore by default (on the patch side).  I 
don't think file ownership should normally be stored in diff files, for 
example.

My thinking was that their is no 'default' set of extra properties that 'diff' outputs - the user has to specify them with keywords. With a traditional patch file from 'diff' (i.e. --compare has not been used), 'patch' would operate exactly as it does now. If --compare=all has been used to supply extra information, then 'patch' would default to using some subset (possibly null) of the available property information - that's open to discussion, and more of a 'patch' matter than a 'diff' one. And a 'patch' option like --use=time,mode etc could be used to explicitly specify some subset of available properties to use.
I have used the keyword 'all', to select all available properties. Maybe we need to supplement this with another keyword, say 'basic', that only selects basic properties likely to be of use in a patch file - e.g. 'basic' would not include ownership. And 'patch' would default to using the same 'basic' subset.

Probably the most widely used extended diff format today is that of git.  The 
"git diff" format always starts with a line of the form:

	"diff --git old_name new_name"

I'm not familiar with the git 'diff' format, but I've had a look at the link you've given. I'm not sure how desirable it is to be compatible with this format. Only the mode information seems to be of  direct relevance to 'diff'. Some of the changes are to do with files that have been renamed or copied. This is information that 'diff' couldn't be expected to generate, although maybe the 'diff' output format should be able to accommodate this information even though it couldn't generate it itself. Also, git diff uses 'old' and 'new' in the output, which are concepts that are not /directly/ relevant to 'diff' itself.

The features that the "git diff" format adds include:

 * Filename quoting using C string literals for filenames which require
   quoting.

Filename quoting is something I'm intending to include in 'diff' separately. Currently I'm using shell quoting style, so only filenames with unusual characters get quoted. It would be easy to change this to some other style already in gnulib, but I'll leave it to others to decide what the best style should be.

 * Distinguishes between empty files and non-existing files.

I'm not sure what format git diff uses. In my updates, -N --compare=all would do it. One file gets marked as "non-existent" (see example at end), and there would be no contents difference output (so it's implicit rather than explicit).

 * Support for symlinks.

Again, I'm not sure how git diff does this. See the example at the end for one way we could incorporate it with --compare=...

 * Support for binary files and binary deltas is included.  (The "payload" is
   base85 encoded in this case.)

I wasn't intending to incorporate this, and am not sure if 'diff' should. If it was incorporated in the future I think a separate option, --binary-content say, might be preferable to adding it to the --compare=... keyword list.

 * The SHA-1 hash of the file before and after a change is included.

Again, I wasn't intending to incorporate this, and am not sure if 'diff' should. A keyword 'hash' could be added to the --compare=.. list. It would lead to some 'quirks' though e.g., diff -b reporting identical contents, but the hashes being different.

--compare=[content,time,mode,size,owner,group,all,objects]
This option provides a number of file properties to compare, for normal 
and both context output styles. Keyword 'content' means the file 
contents, [...]
I don't understand.  There are two things to distinguish here: what diff looks 
at in order to decide if a file is considered "identical", and what 
information diff includes in its output.  Which refers to which?

For most of the examples I gave, if any of the specified file properties differ, then the whole set of specified file properties are printed for that file object (all on one line). I then gave an alternative output format, where only the selected properties that differ are printed (on separate lines). I now think that we should be going with the second format, or something like it.

Character and block special files don't make sense for me to include in a 
diff: they are not portable by design, and make little to no sense to apply in 
a different context (on a different machine or even after a driver load or 
removal on the same machine).

Some special files may not be useful for 'patch', but 'diff' is not a tool for just supplying input for 'patch'. As a tool for saying what's different between files, I think 'diff' should be reporting it if these types of files are present. 'patch' would be free to ignore any information that is deemed not relevant for it.

Directories are traditionally created and removed as needed by GNU patch.  I'm 
not sure if it makes sense to include them in diffs at all, but if we want to 
include them, doing so in a backwards compatible way (so that current 
implementations of patch will ignore directories) will not be easy.

Additional information for directories (and special files for that matter), is only output if the 'objects' keyword is used (e.g. --compare=all,objects). If 'objects' is used then, yes, existing implementations of 'patch' wouldn't cope with it, but they probably wouldn't cope with extra property information for regular files, unless it's all on one line. Also, if a file has been changed to a directory then it would be necessary for 'patch' to have directory information.

Any difference in content or file properties causes the specified 
property information to be written to a modified context style header. 
Note that the keyword 'time' is not written - this saves space, and is 
more compatible with the existing context header line format.

   % diff -u --compare=all aaa bbb
   --- aaa 2010-09-23 20:51:06.984375000 +0100 mode=-rw-r--r-- size=4 
owner=Duncan group=None
   +++ bbb 2010-09-23 20:51:03.015625000 +0100 mode=-rwxr--r-- size=4 
owner=Duncan group=None
   @@ -1 +1 @@
   -111
   +222
Sorry, but stuffing all this information in the "---" and "+++" header lines 
is just horrible.

The idea behind this was to make the new output (more likely to be) compatible with existing implementations of 'patch'. However, I agree that this is not the best way to display the differences, and one line per property is better. This means giving up on having current implementations of 'patch' being able to process the new output format.

Including information which did not actually change also doesn't make sense to 
me.

I agree with this too. This means that the example above would become:

  % diff -u --compare=all aaa bbb
  --- aaa
  +++ bbb
  --- time: 2010-09-23 20:51:06.984375000 +0100
  +++ time: 2010-09-23 20:51:03.015625000 +0100
  --- mode: rw-r--r--
  +++ mode: rwxr--r--
  @@ -1 +1 @@
  -111
  +222

Non-existent files would give this sort of output:

  diff -u -r -N '--compare=all,objects' 5aDN/file1 5bDN/file1
  --- 5aDN/file1 non-existent
  +++ 5bDN/file1
  +++ time: 2000-01-01 00:00:01.000000000 +0000
  +++ mode: -rw-r--r--
  +++ size: 2
  +++ owner: Duncan
  +++ group: None
  +++ object: regular file
  @@ -0,0 +1 @@
  +a

Symbolic links are normally dereferenced and treated like regular files, but a --no-dereference option could be used to treat symbolic links like the other special files:

  % diff -u --no-dereference 12u 12v
  File 12u/z is a regular file while file 12v/z is a symbolic link

  % diff -u --no-dereference '--compare=all,objects' 12u 12v
  diff -u --no-dereference '--compare=all,objects' 12u/z 12v/z
  --- 12u/z
  +++ 12v/z
  --- time: 1999-12-31 23:59:59.000000000 +0000
  +++ time: 2000-01-01 00:00:01.000000000 +0000
  --- mode: -rw-r--r--
  +++ mode: lrwxrwxrwx
  --- size: 3
  +++ size: 5
  --- object: regular file
  +++ object: symbolic link ->  ../12
  @@ -1 +0,0 @@
  -aa

In this last example, the symbolic link is treated as empty (like other special files) for the purposes of the contents differences.

Regards
Duncan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]