bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#11950: cp: Recursively copy ordered for maximal reading speed


From: Alan Curry
Subject: bug#11950: cp: Recursively copy ordered for maximal reading speed
Date: Mon, 16 Jul 2012 15:52:42 -0500 (GMT+5)

Michael writes:
> 
> Hello,
> 
> After coding several backup tools there's something in my mind since years. 
> When 'cp' copies files from magnetic harddisks (commonly called after their 
> adapter or bus - SATA, IDE, and the like, i'm not talking about solid state) 
> recursively, it seems to pick up the files in 'raw' order, just as the disk 
> buffer spit them out (like 'in one head move'). Or so. It does not resemble 
> any alphabetical order, for example, it does not even stay within the same 
> parent folder (flingering hither and forth, as the files come in).

[grumble at User-Agent: claws-mail.org: One line per paragraph isn't good
mail formatting!]

It's called directory order. It used to be simply order of creation of
files, with deletions creating gaps that could be filled by later
creations with same-length or shorter names.

But on most new filesystems, directories are stored in a non-linear
structure so that lookups in a large directory don't have to scan
through every name. For ext2/ext3/ext4, run tune2fs -l on the block
device and look for the dir_index option.

If you're copying files onto a filesystem with dir_index enabled, the
order in which cp creates them should have little effect on the
directory's layout afterward. If you're not using dir_index on the
destination filesystem, there's your problem! Enable dir_index and all
directory lookups will be fast.

None of this has anything to do with where the actual data blocks of the
file will be allocated. There's no way to control that. If you think
that the second file created is going to be adjacent to the first file
created... that's never been guaranteed. Filesystem block allocators are
way more mysterious than that.

If you really think there's something to be gained here, prove it: start
with a directory with a lot of files but no subdirectories. Do an
alphabetical-order copy like this:

$ mkdir other_directory ; cp ./* other_directory

(The glob returns the names in sorted order so this gives you the
creation order you want, unlike cp -r)

Then get it all out of cache so the read test will hit the disk as much
as possible:

$ sync ; echo 3 > /proc/sys/vm/drop_caches

And read back the files:

$ cd other_directory ; time cat ./* > /dev/null

Now repeat, but using cp -r to create the other directory so the files
get copied in the source directory order. And repeat again, but using

$ find . -type f -exec cat '{}' + > /dev/null

instead of the cat ./* (the glob will cat the files in sorted order, the
find will use directory order).

If there are any significant differences in the times, and dir_index is
enabled, you're onto something. With dir_index disabled, you should get
worse times all around, but not a lot worse if the files are big enough
that the time spent reading their contents overshadows the time spent on
directory lookups.

-- 
Alan Curry





reply via email to

[Prev in Thread] Current Thread [Next in Thread]