coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cp: behavior regression in 8.23


From: Bob Proulx
Subject: Re: cp: behavior regression in 8.23
Date: Sat, 31 Jan 2015 20:00:44 -0700
User-agent: Mutt/1.5.23 (2014-03-12)

Pádraig Brady wrote:
> TAMUKI Shoichi wrote:
> >> What's the particular problem you have with the order
> >> of the files in the tar archive, so I understand your issue completely?
> > 
> > The point is cp should keep the function to preserve the deterministic
> > directory structure of the original files/directories in the copy.
> > That was possible with the cp in coreutils-8.22 or earlier.
> 
> There is no such guarantee from the system though.
> Depending on the file system, number of files and the structure
> of the underlying tree etc. the order can change.
> cp uses savedir() (similar to scandir), which calls readdir(),
> and readdir() is not deterministic.

At one time directories were purely files, special files but files
just the same.  Entries were stored as a list.  Lookup was linear.  In
those old days an accidental order appeared by the order files were
listed as entries in the file.  The ordering depended upon the history
of files created and deleted.

The list structure had some severe problems.  Any directory with a
large number of files in them (where large is often a low small number
such as 5,000) became very inefficient to work with due to the linear
lookup nature of the lists.  Directories would really slow down.  For
many years now newer file systems use B-trees for the on disk
directory file structure.  The old linear lists are mostly a legacy
that shouldn't be seen anymore in real usage.

> > This issue is reproducible on ext[234] w/o dir_index feature, btrfs,
> > and xfs filesystems.

And without the dir_index feature you are using the old directory list
behavior instead of the tree/index.  That is why you are seeing it.
If you did use the dir_index then the internal order would always be
sorted due to the tree structure.

> I still don't understand why this is an issue TBH.
> Directory listing programs like ls normally sort results.
> If you want reproducible builds then tar has the --sort=name option.

Usually I am arguing the side of the status quo and supporting legacy
behavior.  But I don't think this is one of those features that should
be preserved.  That previous directories were implemented as a list of
entries feels to me like an internal implementation detail that should
not be known on the outside.  Knowing that and making use of that
detail feels like a violation of the abstract data type of the
directory.  I don't think it is one of the features that should be
preserved.  Especially since the negative effects of the linear list
of entries was quite severe for a large number of entries.
Directories using Btrees (the dir_index feature) is a huge step
forward.

If I were writing that tar archive process and the order of files in
the tar archive were important to me then I think it would require me
to create the archive in the specific order I needed.  In the old days
of linear directory entries I wouldn't count on the directory to
provide it.  Instead I would likely build the entry list using find
and then use that to feed to tar to create the archive in the known
order.  Anything else just feels bad to me since the implementation
details could have the directory order in other sequences.  As you
have found to be the actual case.

And also for a long time now I almost always sort everything.  For
example hash tables can be fast but will emit entries in an obscure
order.  Even when repeatable humans don't find it intuitive.  A long
time ago I learned that sorted ordering just worked better for both
people and repeatable processes and comparing repeatable processes.

> > Anyway, in some cases, the copying directory tree will need to be done
> > as fast as possible, even ignoring the order of the readdir calls.
> > However, I don't think changing the specification will be a good idea
> > because cp has been used in the same manner as before for close to
> > three decades.
> > 
> > So, I propose to add --sort={none,name,inode} option to cp command.
> 
> I'm inclined to think an option is not appropriate here,
> as it doesn't provide any subsequent guarantees from the system.

FWIW I am in agreement on this thinking too.  Not appropriate.  As an
abstract data type we don't really want to know how it is implemented.
It really feels to me that if the order is important that it should be
maintained explicitly and not accidentally.  And adding another option
seems like the wrong thing to do here.  It is burrowing into a very
deep and unique implementation detail when the problem should be
solved at a different level.  Just my opinion.

> It's such a widely used hack that I think file systems
> would continue to have inode order somewhat related to
> locality on disk (which is still important for SSDs).

Not usually.  An SSD presents a logical block structure to the outside
world.  But internally the SSD firmware will be routinely
repositioning the physical location of the data within the flash
cells.  Not only isn't there a way from the outside to know where the
data is stored it will also be routinely moved around by the SSD
internal firmware.  Basically with an SSD everything that we ever
thought we knew about classic disk drives storage algorithms are
gone.  In its place we have a true abstract interface to a database on
the SSD where the actual storage is mapped and re-mapped all of the
time.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]