[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-diffutils] New diff option to ignore equivalent strings
[bug-diffutils] New diff option to ignore equivalent strings
Mon, 27 Sep 2010 11:49:43 +0100
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:188.8.131.52) Gecko/20100915 Thunderbird/3.1.4
Here's a description and source of a new diff option to ignore
-z ERE --ignore-equivalent-strings=ERE
Treat as equivalent any strings that match an extended regular
expression (ERE), at the same position in both files. (The examples
below should make this a bit clearer). It operates during the
differencing algorithm itself (unlike -I which post-processes the 'diff'
output). The option can be used multiple times (although there are some
caveats explained below), and ERE matching respects the case set by -i.
Note that -z does not directly ignore strings, it is merely treating
them as equivalent. However, it is possible to genuinely ignore strings
if the ERE can match an empty string (e.g. -z 'str|' or -z ' *'). Also
note that the ERE is treated symmetrically between both files (it's not
possible to treat as equivalent some string only in file 1 and some
other string only in file 2). Here are some examples:
-z abc !! Does nothing - don't use it like this !!
This will not affect the 'diff' output! Any string "abc" in file 1 would
be considered equivalent to any string "abc" in file 2. Of course, this
is what would happen anyway, had -z not been used. There is no point in
giving -z a plain string like this.
-z 'unsigned char|char'
The strings "unsigned char" and "char" are considered to be equivalent
to each other. Anywhere "char" appears in a file, if there is a
corresponding string "unsigned char" in the other file, then this
difference is ignored. Of course, "char" in one file still matches
"char" in the other, and the same is true for "unsigned char".
-z 'unsigned char|signed char|char'
The strings "unsigned char", "signed char" and "char" are all considered
to be equivalent to each other.
-z 'const |'
This genuinely ignores all strings "const ". What it does is to consider
any strings "const " to be equivalent to the empty string "". As far as
the differencing is concerned it's as if the string "const " doesn't
exist in the files - the strings will be output of course if the lines
differ for any other reason.
All contiguous runs of whitespace are treated as equivalent. This
includes runs of length 0. This means that all whitespace is effectively
genuinely ignored. It is equivalent to -w.
-z '[[:space:]]*$' -z '[[:space:]]\+'
All contiguous runs of whitespace at the end of a line are ignored, and
all non-zero contiguous runs of whitespace anywhere in a line are
ignored. It is equivalent to -b.
Treat (a wide variety of) numbers as equivalent to each other e.g.
3.14E0 27.18E1 789 are all the same.
-z '[0-3]\+[0-9] (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
Treat a particular date format as equivalent to each other.
Here are some more details of the algorithm. It has to take the hashing
of each line into account. In this respect it is similar to -E, -b and
-w which treat sequences of whitespace as equivalent by making
appropriate sequences of whitespace contribute identically to the hash
value of each line. Similarly with -z, strings in the text files that
match the same ERE contribute identically to the hash value. Subsequent
to this, 'diff' does a direct comparison of lines from both files which
it judges to be identical (based on the hash) - that logic also has to
take equivalent strings into account.
The matching algorithm is: strings are searched for along each line by
determining the first ERE (in the order they are specified on the
command line) with the next matching starting column. The search then
continues from the column after the end of the string just matched.
It was my original intention to match against a different ERE for each
file. Unfortunately this is incompatible with the way diff is coded - it
breaks the concept of equivalent classes. Here's an example, matching
against strings rather than EREs for simplicity. Lets assume that the
notation "-z a=b" means string "a" in file1 is equivalent to string "b"
in file2, and that file1 and file2 consist only of lines containing a
single "a" or "b". Now, a line "a" in file1 goes into equivalence class
1. Then a line "b" also from file1 has to go into a different class,
call it equivalence class 2. But which equivalence class should a line
"b" from file2 go into? It should really go into both since implicitly
"b=b", but by definition "a=b".
Mathematically, for an equivalence relation (~) to hold we must have the
Reflexivity: a ~ a
Symmetry: if a ~ b then b ~ a
Transitivity: if a ~ b and b ~ c then a ~ c.
It seems that the second two properties would no longer hold for
something like "-z a=b".
Personally, I find this option extremely useful, as it can remove large
numbers of irrelevant differences, but the user does need to take some
care when specifying multiple EREs. In these cases, if strings matched
by an ERE overlap with strings matched by another ERE (or with any
whitespace characters matched by -E, -b or -w), then the equivalence
assumptions will be broken. If matches could overlap with the match
starting in the same column, then the ERE with the longest match should
normally be placed first. In practise the user should make the EREs as
specific as possible. Some pitfalls of multiple EREs are:
a) -z 'a|b' -z 'a|c'. An "a" in one file would never be equivalent to a
"c" in the other file (-z 'a|b|c' could be used instead - although it's
not quite the same because "b" and "c" are now equivalent to each
other). What happens in the algorithm is that when a character 'a'
occurs in a file, it is hashed similarly to a character 'b' (since this
is the first ERE on the command line). The ERE 'a|c' doesn't get a look
in since ERE 'a|b' takes priority. Even if the first option did not take
priority, we would still have problems since we would loose the
transitivity relation, since b ~ c is not enforced.
b) -z 'abc|uvw' -z 'abcdef|uvwxyz'. "abcdef" and "uvwxyz" would never be
equivalent, since "abcdef" is matched by "abc" first. Switching the ERE
order would work better, because "abc" could not be matched by "abcdef"
(although the strings "abcdef" and "uvwdef" would then be thought
c) -z '[^c]b' -z 'a[be]'. "ab and "ae" would never be equivalent, since
"ab" is matched by "[^c]b" first.
All these pitfalls have a root cause of multiple EREs being specified
and the strings matching them potentially overlapping. I'm not sure of
the best way to handle this. Disallowing multiple -z's seems a bit
excessive. A way to detect possible overlap of two ERE's matching some
(unspecified) string would be useful, so that a warning could be issued,
but I'm not sure if such an algorithm exists.
Some of the this update is removal of redundant code. In function
find_and_hash_each_line() in io.c IGNORE_TAB_EXPANSION attempts to
account for occurrences of '\b' and '\r'. This coding is redundant
because function lines_differ() in util.c has nothing equivalent. I've
removed the redundant code. Presumably the intention was to ignore
changes that would not be apparent if the files are 'cat'ted, or
something similar - I'm not sure that there's an efficient and general
way to do this.
Description: Text document
|[Prev in Thread]
||[Next in Thread]|
- [bug-diffutils] New diff option to ignore equivalent strings,
Duncan Moore <=