coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD' and


From: Pádraig Brady
Subject: Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD' and `--quoting-style=WORD'
Date: Thu, 22 May 2014 16:50:12 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 05/22/2014 03:05 PM, Stefan Vargyas wrote:
>> Date: Thu, 22 May 2014 12:28:22 +0100
>> From: Pádraig_Brady <address@hidden>
>> Subject: Re: stat: added features: `--files0-from=FILE', `--digest-type=WORD'
> 
>>    join -j2 <(stat -c '%s  %n' /bin/ls /bin/cp | sort) <(sha1sum /bin/cp
>> /bin/ls | sort)
> 
>>    tr '\n' '\1' |
>>    sort |
>>    uniq -u ...
> 
> Your remarks are correct iff stat and sha1sum output *are* able to produce
> consistently joinable outputs. However when attempting to employ such usage
> patterns into *generally usable scripts*, one has to take care of possible
> inconsistencies (leading to bugs!) occurring when file names contain SPACE,
> TAB, NL and other such chars.

> A solution would be to impose TAB only as field separator -- thus ensuring 
> that
> it cannot appear anywhere else. Then one might invoke join with "-t $'\t'". 
> With
> this condition, it should be clearer why the need of '--quoting-style=escape'
> and '--digest-type=sha1' options and of '%S' format specifier for stat.

This boils down to having a standard unambiguous escaped file name
format that was consistent between tools, and thus could be joined on.
I.E. any of these could take a --quote-name option: du ls *sum stat wc.
Also realpath could be generally used to convert to/from this quoting.

Now you might do all but the checksum within `find`, though
then you'd have to have consistency between the names of tools
from separate projects, which could be awkward as shown below:

$ touch "ta     b"

$ stat --printf '%f\t%a\t%u\t%g\t%h\t%i\t%s\t%W\t%X\t%Y\t%Z\t%N\n' "ta  b"
81b4    664     500     500     1       2386095 0       0       1400771654      
1400771654      1400771654      `ta\tb'
$ find "ta      b" -printf 
'%%f\t%m\t%U\t%G\t%n\t%i\t%s\t%%W\t%A@\t%T@\t%C@\t%p\n'
%f      664     500     500     1       2386095 0       %W      
1400771654.6712056900   1400771654.6712056900   1400771654.6712056900   ta?b

$ ls --quote-name "ta   b"
"ta\tb"
$ ls  "ta       b"
ta?b

Alternatively one could avoid joining on names entirely and use inode numbers?
Some file systems though generate inode numbers on the fly, but this could
usually be a more general attribute to key on?

>> There is no advantage of supporting this option in stat
>> as that is only useful when a command needs to process all
>> file names in a _single invocation_, like when sorting or accumulating etc.
>> For stat one can efficiently:
>>
>>    find ... -print 0 | xargs -r0 stat ...
>>
>> or
>>
>>    find ... -exec stat {} +
> 
> One meaningful reason for single invocation is efficiency. The input to stat
> can be huge (and in my initially evoked scenario in fact often is!) -- and
> that possible large amount of data propagates downward the multiple pipelines
> and fifos of your scenario above.

Note the above is only 1 stat process per every few thousand files.

>> Note also that sort has the --zero-terminated option, as do newer versions of
>> join and uniq.
> 
> The fanciful '-0|--null' options refers to both input and output of sort. The
> existing '-z|--zero-terminated' -- only to sort's output.

Input is handled too:

  $ printf '%s\000' 3 1 2 | sort -z | tr '\0' '|'
  1|2|3|

BTW I see how the existing sort man page can be confusing.
I think I'll update the --zero-terminated descriptions to:

  -  -z, --zero-terminated     end lines with 0 byte, not newline
  +  -z, --zero-terminated     items are delimited with NUL, not newline

>> This could be useful, however there is already the %N option for quoted file
>> name.
>>
>> $ stat -c %N /bin/ls
>> ‘/bin/ls’
>> $ LANG=C src/stat -c %N /bin/ls
>> '/bin/ls'
> 
> Recall the claimed consistency from above. In case of symlinks, %N produces
> output like the one below:
> 
>   $ touch /tmp/foo
>   $ ln -sv /tmp/foo /tmp/bar
>   `/tmp/bar' -> `/tmp/foo'
>   $ stat -c %N /tmp/bar
>   `/tmp/bar' -> `/tmp/foo'
>   $

Fair enough. But note disregarding the join issue on file names,
the above would be useful to compare on as you get the name
and the symlink target directly compared. This is another argument
for using the inode as the key to sort/compare on rather than the name.

> Also, in case of symlinks, the digest sum computing programs do follow the
> links, i.e. they actually compute digests for the content of the file to which
> the symlink file points to:
> 
>   $ sha1sum /tmp/foo /tmp/bar
>   da39a3ee5e6b4b0d3255bfef95601890afd80709  /tmp/foo
>   da39a3ee5e6b4b0d3255bfef95601890afd80709  /tmp/bar
> 
> The semantics of %S in the proposed patches is different however: the new stat
> produces the digest of the *content* of the file itself. In case of symlinks
> that content is obtained via 'areadlink_with_size':
> 
>   $ stat2 -c '%S  %n' /tmp/foo /tmp/bar
>   da39a3ee5e6b4b0d3255bfef95601890afd80709  /tmp/foo
>   469150566bd728fc90b4adf6495202fd70ec3537  /tmp/bar

Ah right thanks. We could add -LP to the *sum utils to cater for this,
but the point in the previous paragraph about keying on inode number
would avoid this use case.

thanks,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]