coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cp: behavior regression in 8.23


From: TAMUKI Shoichi
Subject: Re: cp: behavior regression in 8.23
Date: Sun, 01 Feb 2015 22:13:43 +0900

Hello Padraig and Bob,

From: Bob Proulx <address@hidden>
Subject: Re: cp: behavior regression in 8.23
Date: Sat, 31 Jan 2015 20:00:44 -0700

> Padraig Brady wrote:
> > TAMUKI Shoichi wrote:
> > > > What's the particular problem you have with the order
> > > > of the files in the tar archive, so I understand your issue completely?
> > > 
> > > The point is cp should keep the function to preserve the deterministic
> > > directory structure of the original files/directories in the copy.
> > > That was possible with the cp in coreutils-8.22 or earlier.
> > 
> > There is no such guarantee from the system though.
> > Depending on the file system, number of files and the structure
> > of the underlying tree etc. the order can change.
> > cp uses savedir() (similar to scandir), which calls readdir(),
> > and readdir() is not deterministic.

Well, readdir() behaves deterministic if it runs on ext2 and ext[34]
without dir_index feature.  It also behaves deterministic on btrfs and
xfs.  On the other hand, readdir() is not deterministic ext[34] with
dir_index feature.  So, I also agree that there is no such guarantee
from the system.

> At one time directories were purely files, special files but files
> just the same.  Entries were stored as a list.  Lookup was linear.  In
> those old days an accidental order appeared by the order files were
> listed as entries in the file.  The ordering depended upon the history
> of files created and deleted.
> 
> The list structure had some severe problems.  Any directory with a
> large number of files in them (where large is often a low small number
> such as 5,000) became very inefficient to work with due to the linear
> lookup nature of the lists.  Directories would really slow down.  For
> many years now newer file systems use B-trees for the on disk
> directory file structure.  The old linear lists are mostly a legacy
> that shouldn't be seen anymore in real usage.

That is exactly as you say, however not only older file systems such
as ext2 but also newer file systems such as btrfs uses linear search
to readdir().

Here is a tiny test script.  I have invoked it on ext4 w/o dir_index
and btrfs respectively.  The result is that 00 and 11 are same on both
file systems.

#!/bin/sh -x
wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.22.tar.xz
sudo tar xpJf coreutils-8.22.tar.xz
mkdir dup
sudo cp -a coreutils-8.22 dup   # <-- invoking cp from coreutils-8.22
tar cpJf coreutils-8.22-dup.tar.xz --numeric-owner -C dup coreutils-8.22
tar tvpJf coreutils-8.22.tar.xz > 00
tar tvpJf coreutils-8.22-dup.tar.xz > 11
diff -u 00 11

Thus readdir() on ext4 w/o dir_index and btrfs behaves deterministic.
Maybe readdir() on xfs also behaves the same.

> > > This issue is reproducible on ext[234] w/o dir_index feature, btrfs,
> > > and xfs filesystems.
> 
> And without the dir_index feature you are using the old directory list
> behavior instead of the tree/index.  That is why you are seeing it.
> If you did use the dir_index then the internal order would always be
> sorted due to the tree structure.

That's right.  For that reason, people who want readdir() to behave
deterministic on ext[34] file systems, they dare to make it a rule to
disable dir_index feature.

> > I still don't understand why this is an issue TBH.
> > Directory listing programs like ls normally sort results.
> > If you want reproducible builds then tar has the --sort=name option.

There may be cases that we want tar to create an archive neither with
--sort=name option nor with --sort=inode option.  For example, there
may be a case to package a software in "make install" order (see
below.)  I think that is why tar uses --sort=none option as default.

tamuki@wombat:~$ tar xpJf coreutils-8.22.tar.xz
tamuki@wombat:~$ cd coreutils-8.22
tamuki@wombat:~/coreutils-8.22$ ./configure
tamuki@wombat:~/coreutils-8.22$ make
tamuki@wombat:~/coreutils-8.22$ make install DESTDIR=/home/tamuki/work
tamuki@wombat:~/coreutils-8.22$ ls -fl --fu ~/work/usr/local/bin
total 16826368
drwxr-xr-x 2 tamuki users   4096 Sun Feb  1 19:23:46 2015 ./
drwxr-xr-x 5 tamuki users   4096 Sun Feb  1 19:23:47 2015 ../
-rwxr-xr-x 1 tamuki users 429286 Sun Feb  1 19:23:46 2015 install*
-rwxr-xr-x 1 tamuki users 109466 Sun Feb  1 19:23:46 2015 chroot*
-rwxr-xr-x 1 tamuki users  89725 Sun Feb  1 19:23:46 2015 hostid*
-rwxr-xr-x 1 tamuki users  97909 Sun Feb  1 19:23:46 2015 nice*
-rwxr-xr-x 1 tamuki users 170171 Sun Feb  1 19:23:46 2015 who*
-rwxr-xr-x 1 tamuki users  97804 Sun Feb  1 19:23:46 2015 users*
-rwxr-xr-x 1 tamuki users 118058 Sun Feb  1 19:23:46 2015 pinky*
-rwxr-xr-x 1 tamuki users 129745 Sun Feb  1 19:23:46 2015 uptime*
-rwxr-xr-x 1 tamuki users 197330 Sun Feb  1 19:23:46 2015 stty*
-rwxr-xr-x 1 tamuki users 349906 Sun Feb  1 19:23:46 2015 df*
-rwxr-xr-x 1 tamuki users 222445 Sun Feb  1 19:23:46 2015 stdbuf*
-rwxr-xr-x 1 tamuki users 121279 Sun Feb  1 19:23:46 2015 [*
-rwxr-xr-x 1 tamuki users 114934 Sun Feb  1 19:23:46 2015 base64*
-rwxr-xr-x 1 tamuki users  93290 Sun Feb  1 19:23:46 2015 basename*
        :
        :

> Usually I am arguing the side of the status quo and supporting legacy
> behavior.  But I don't think this is one of those features that should
> be preserved.  That previous directories were implemented as a list of
> entries feels to me like an internal implementation detail that should
> not be known on the outside.  Knowing that and making use of that
> detail feels like a violation of the abstract data type of the
> directory.  I don't think it is one of the features that should be
> preserved.  Especially since the negative effects of the linear list
> of entries was quite severe for a large number of entries.
> Directories using Btrees (the dir_index feature) is a huge step
> forward.

I understand your feeling but there are surely cases making use of
that nowadays.  I think that archiving with cp -a is substantially
the same as creating a archive with tar.  The differences between them
are destinations for archive (the former is a directory and the latter
is a tar archive,) and there are way to check the listing of them
(the former is ls -f (-U) option and the latter is tar t option.)

> If I were writing that tar archive process and the order of files in
> the tar archive were important to me then I think it would require me
> to create the archive in the specific order I needed.  In the old days
> of linear directory entries I wouldn't count on the directory to
> provide it.  Instead I would likely build the entry list using find
> and then use that to feed to tar to create the archive in the known
> order.  Anything else just feels bad to me since the implementation
> details could have the directory order in other sequences.  As you
> have found to be the actual case.

As noted above, readdir() not only on ext4 w/o dir_index but also on
btrfs (and xfs) behaves deterministic.  If I create a tar archive with
meaningful order, first I prepare the order of the target directory
tree with careful consideration, and then I archive them.

> And also for a long time now I almost always sort everything.  For
> example hash tables can be fast but will emit entries in an obscure
> order.  Even when repeatable humans don't find it intuitive.  A long
> time ago I learned that sorted ordering just worked better for both
> people and repeatable processes and comparing repeatable processes.

I agree with you.

> > > Anyway, in some cases, the copying directory tree will need to be done
> > > as fast as possible, even ignoring the order of the readdir calls.
> > > However, I don't think changing the specification will be a good idea
> > > because cp has been used in the same manner as before for close to
> > > three decades.
> > > 
> > > So, I propose to add --sort={none,name,inode} option to cp command.
> > 
> > I'm inclined to think an option is not appropriate here,
> > as it doesn't provide any subsequent guarantees from the system.
> 
> FWIW I am in agreement on this thinking too.  Not appropriate.  As an
> abstract data type we don't really want to know how it is implemented.
> It really feels to me that if the order is important that it should be
> maintained explicitly and not accidentally.  And adding another option
> seems like the wrong thing to do here.  It is burrowing into a very
> deep and unique implementation detail when the problem should be
> solved at a different level.  Just my opinion.

Indeed, the cp --sort={none,name,inode} option might be feel strange
on file systems whose readdir() is not deterministic.  On the other
hand, the cp --sort=name might be useful on file systems whose
readdir() is deterministic though. :-(

Regards,
TAMUKI Shoichi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]