bug#6557: du sometimes miscounts directories, and files whose link count

From: Jim Meyering
Subject: bug#6557: du sometimes miscounts directories, and files whose link count equals 1
Date: Sat, 03 Jul 2010 10:18:18 +0200

Paul Eggert wrote:
> (I found this bug by code inspection while doing the du performance
> improvement reported in:
> http://lists.gnu.org/archive/html/bug-coreutils/2010-07/msg00014.html
> )
> Unless -l is given, du is not supposed to count the same file more
> than once.  It optimizes this test by not bothering to put a file into
> the hash table if its link count is 1, or if it is a directory.  But
> this optimization is not correct if -L is given (because the same
> link-count-1 file, or directory, can be seen via symbolic links) or if
> two or more arguments are given (because the same such file can be
> seen under multiple arguments).  The optimization should be suppressed
> if -L is given, or if multiple arguments are given.
> Here is a patch, with a couple of test cases for it.  This patch
> assumes the du performance fix, but I can prepare an independent
> patch if you like.

Actually, that patch applies just fine, as-is.
However, it induces this new "make check" test failure:

    FAIL: du/files0-from (exit: 1)

    du (GNU coreutils) 8.5.75-569b2
    Copyright (C) 2010 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later 
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.

    Written by Torbjorn Granlund, David MacKenzie, Paul Eggert,
    and Jim Meyering.
    files0-from: test 2: stdout mismatch, comparing 2.O (actual) and 2.1 
    *** 2.O Sat Jul  3 09:28:08 2010
    --- 2.1 Sat Jul  3 09:28:08 2010
    *** 1 ****
    --- 1,2 ----
      0     g
    + 0     g
    files0-from: test 2a: stdout mismatch, comparing 2a.O (actual) and 2a.1 
    *** 2a.O        Sat Jul  3 09:28:08 2010
    --- 2a.1        Sat Jul  3 09:28:08 2010
    *** 1 ****
    --- 1,2 ----
      0     g
    + 0     g

That's because with the unpatched "du", a command like this, with
a duplicate argument, prints two lines, while the patched version
prints two:

    $ seq 100 > g; du g g
    4       g
    4       g

    $ seq 100 > g; ./du g g
    4       g

Note that the vendor versions of "du" from at least Solaris 10,
openBSD, netBSD and freeBSD print both lines.
I prefer the new semantics, especially when using --total:

    $ seq 100 > g; du --total g g
    4       g
    4       g
    8       total

    $ seq 100 > g; ./du --total g g
    4       g
    4       total

You can get some of the old semantics by using -l:

    $ seq 100 > g; ./du -l --total g g
    4       g
    4       g
    8       total

What do you think of breaking with that tradition?  POSIX does appear
to say that for each "FILE" argument du must print a line, but it also
mentions how with linked files, the space must be counted only once.
You can definitely consider listing the same file twice as being
analogous to a file being hard-linked.

An alternative might be to do this,

    $ seq 100 > g; du --total g g
    4       g
    0       g
    4       total
but this is too prone to misinterpretation both by people and by code
that parses du output.  So I'm inclined to go with your approach.

This is the additional patch we'd need to make the failing
failing test accept your new output.  You're welcome to merge
it into yours.

diff --git a/tests/du/files0-from b/tests/du/files0-from
index 620246d..860fc6a 100755
--- a/tests/du/files0-from
+++ b/tests/du/files0-from
@@ -70,15 +70,15 @@ my @Tests =
     {IN=>{f=>"g\0"}}, {AUX=>{g=>''}},
     {OUT=>"0\tg\n"}, {OUT_SUBST=>'s/^\d+/0/'} ],

-   # two file names, no final NUL
+   # two identical file names, no final NUL
    ['2', '--files0-from=-', '<',
     {IN=>{f=>"g\0g"}}, {AUX=>{g=>''}},
-    {OUT=>"0\tg\n0\tg\n"}, {OUT_SUBST=>'s/^\d+/0/'} ],
+    {OUT=>"0\tg\n"}, {OUT_SUBST=>'s/^\d+/0/'} ],

-   # two file names, with final NUL
+   # two identical file names, with final NUL
    ['2a', '--files0-from=-', '<',
     {IN=>{f=>"g\0g\0"}}, {AUX=>{g=>''}},
-    {OUT=>"0\tg\n0\tg\n"}, {OUT_SUBST=>'s/^\d+/0/'} ],
+    {OUT=>"0\tg\n"}, {OUT_SUBST=>'s/^\d+/0/'} ],

    # Ensure that $prog processes FILEs following a zero-length name.
    ['zero-len', '--files0-from=-', '<',

