[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: findin sloc changes between two tags

From: Paul Sander
Subject: Re: findin sloc changes between two tags
Date: Tue, 19 Feb 2008 11:01:00 -0800

On Feb 19, 2008, at 12:35 AM, yeti wrote:

On Feb 19, 1:06 pm, Paul Sander <address@hidden> wrote:
On Feb 18, 2008, at 8:40 PM, yeti wrote:

On Feb 19, 4:38 am, Paul Sander <address@hidden> wrote:
For this particular metric, I usually run the two versions through a
beautifier with standard settings, then diff the output of that.

On Feb 18, 2008, at 10:17 AM, Rick Genter wrote:

From: address@hidden
On Behalf Of Ted Stern

But that regexp handles only C++ comments.  I don't know of a
way to
recognize /* ... [text containing newlines] ... */.  Possibly
diff utility has that options (xxdiff, tkdiff?).

You could write an awk or perl script to filter the multiline
out, save the output to a file, then diff those files. I, however,
consider comments to be equally (or even more) important to non-
in source code, and don't understand the use case.- Hide quoted
text -

- Show quoted text -

Hi guys,

Thanks for all those answers. I however thought that this would be a
fairly common problem and there might be a standard solution. Keeping
your suggestions in mind I did

cvs diff -wlcbBC20 -r rev1 -r rev2 my_file.c | perl -0777 -pe 's{/
\*.*?\*/}{}gs' | diffstat >> FileToHoldInfo.txt

idea is to get enough context lines and then eliminate the comments
from the diff output and finally use diffstat to gather stats. Do you
think this is the correct way ??

I think that this method will work only if the comments are
completely enclosed within the context displayed by the diff
program.  It will fail (i.e., produce incorrect output), for example,
if a short sentence is added to the end of a 50-line comment.  Or to
the beginning of one.  Or to the middle of a 100-line comment.  It
also fails if someone arbitrarily inserts or removes newlines in the
code itself.

This is where beautifiers such as the "indent" program come in.  It
normalizes the format of the source code based on the syntax of the
programming language and policies specified on its command line.  It
leaves comments in place, so additional filtering (like your Perl one-
liner above) might be necessary.

After the two versions have been reduced to standard formats, you can
apply the diff program with minimal arguments.  Its output can be
used to count the number of lines inserted, deleted, and changed.

Yes you are right I'm assuming that most comments would be 20 line
wide though one can as well use -C50 to make it work for 50 line wide
comments and so on. To remove blank lines regexp can be modified. But
now I have detected another problem :-(

Your algorithm also won't handle cases where users arbitrarily reformat the code. In C, for example, the following styles are common:

1a.  Insert newlines between terms in complex boolean expressions.
1b. Make expressions as wide as possible and insert newlines only to avoid wraparound issues within terms.

2a. Surround all curly braces with newlines so that they always appear alone on lines of code. 2b. Place open curly braces at the ends of lines, and combine open and close braces with "else" keywords on a single line.

Beautifiers cut through this cosmetic stuff, immunizing the metrics from arbitrary reformats. On the other hand, they don't handle certain cases where users insert or remove optional artifacts, like inserting braces where they are allowed but not required.

If I check out two different versions of the file and apply unix diff
over them the results are very different from those obtained using cvs
diff on two revisions. cvs diff is showing 256 modifications (!) in
the code when there are no modifications at all. There are about 700
additions (+) but cvs diff is showing only 424 (+). I think cvs diff
is confusing some additions with modifications. However unix diff on
files gives correct results.
I wonder why is cvs diff showing incorrect results ? Is this a known
problem ? If so are there any workarounds for it.

There are a lot of differencing algorithms out there. Some of them minimize the number of edits between versions, others minimize the size of the edits. Additionally, CVS has access to the individual deltas between versions, and it may be combining them in ways you don't expect (rather than constructing the two selected versions and running diff on them).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]