bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#10281: [1003.1(2008)/Issue 7 0000527]: du and files found via multip


From: Eric Blake
Subject: bug#10281: [1003.1(2008)/Issue 7 0000527]: du and files found via multiple command line arguments
Date: Thu, 12 Jan 2012 10:49:34 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:9.0) Gecko/20111222 Thunderbird/9.0

This topic came up again on the Austin Group call today, with no good
resolution yet.

On 12/18/2011 03:03 PM, Paul Eggert wrote:
> Eric Blake's Option 1 does not appear to be tenable, as du
> traditionally preserved hashes of duplicate files across all
> of its operands.  7th Edition Unix 'du' did that, and (as
> Jilles Tjoelker pointed out) so do at least two current 'du'
> implementations, namely, FreeBSD and GNU.
> 
> The idea behind Eric's Option 2 is better, but its wording
> is unclear partly because of another issue Jilles raised:
> whether a file's disk space should be counted multiple times
> if the file occurs multiple times and its link count is 1.
> For example:
> 
>   mkdir d
>   cd d
>   cp /bin/sh w
>   cp w y
>   ln y ../y
>   ln -s w x
>   ln -s y z
>   du -aL
> 
> This analyzes a directory with two regular files, 'w' and
> 'y'.  GNU and Solaris du count these files once each, with
> an accurate sum of non-symlink disk usage under the current
> directory.  But w's link count is 1 so FreeBSD counts 'w'
> twice, thus overcounting disk usage.
> 
> The current POSIX wording does not say what to do for this
> example, but the intent is to avoid overcounting disk usage,
> and the GNU and Solaris behavior supports this intent better.
> (The 7th Edition Unix behavior agrees with FreeBSD, but this
> predates symbolic links so the behavior is now dubious.)

One of the points made is that the standard currently requires elision
only for files with link counts > 1.  An interesting example with
FreeBSD du:

$ echo > a
$ du -a a a
2       a
2       a
$ ln a b
$ du -a a a
2      a
$

That is, the second argument was elided when the inode for 'a' is found
in the hash, which means the hash is preserved across arguments; but the
inode for 'a' is only put in the hash if the link count is > 1.

> 
> Given all the above, the standard's wording could be
> improved in several different ways, all elaborations of
> Option 2.  Here are two possibilities:
> 
>   Option 2A - require that files be hashed among all
>   operands, and that disk usage be counted at most once.
> 
>     Change line 84170 [du DESCRIPTION] from:
> 
>       Files with multiple links shall be counted and written
>       for only one entry.
> 
>     to:
> 
>       A file that occurs multiple times shall be counted and
>       written for only one entry, even if the occurrences
>       are under different file operands.
> 
>   Option 2B - leave unspecified whether files are hashed
>   among all operands, and leave unspecified whether disk
>   usage is counted multiple times for files whose link
>   count does not exceed 1.  From the user's point of view,
>   this means du's output is a reliable count of disk usage
>   only if du is invoked without -L and with -x and with at
>   most one operand.
> 
>     Change line 84170 [du DESCRIPTION] from:
> 
>       Files with multiple links shall be counted and written
>       for only one entry.
> 
>     to:
> 
>       A file that occurs multiple times under one file
>       operand and that has a link count greater than 1 shall
>       be counted and written for only one entry.  It is
>       implementation-defined whether a file that has a link
>       count no greater than 1 is counted and written just
>       once, or is counted and written for each occurrence.
>       It is implementation-defined whether a file that
>       occurs under one file operand is counted for other
>       file operands.
> 
> Option 2A is simpler and clearer, but it invalidates many
> existing implementations.  Option 2B modifies the standard
> to describe how existing implementations actually work, but
> is more complicated and more of a hassle to use reliably.
> 
> Eric raised one other issue: the description of the -a
> option implies that "du A B" must always list B.  This
> implication is incorrect for 7th edition Unix du, GNU du,
> and (I expect) FreeBSD du, so it should be fixed as well.
> Here's one possible fix, which is independent of the
> abovementioned changes.
> 
>   Change line ????? [du OPTIONS] from:
> 
>     Regardless of the presence of the -a option,
>     non-directories given as file operands shall always
>     be listed.
> 
>   to:
> 
>     The -a option does not affect whether
>     non-directories given as file operands are listed.
> 
> (Sorry, I don't know the line number here; I don't have a
> PDF copy of the current standard and don't know offhand how
> to get one.)

It boils down to a decision of whether we want to standardize a useful
behavior, and whether that behavior avoids over-counting, but possibly
invalidating existing implementations (in which case, it is better
targetted to Issue 8), or whether we give up and declare things
unspecified when encountering files with link count of 1 through
multiple locations (in which case we could make the changes in TC2 of
Issue 7, and still make recommendations on the underlying goal of
avoiding over-counting).

The call today also mentioned that cpio may have a similar issue on
overcounting.

-- 
Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]