bug-findutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 3/3] find: take advantage of new gnulib/fts leaf-optimization


From: Jim Meyering
Subject: [PATCH 3/3] find: take advantage of new gnulib/fts leaf-optimization
Date: Sat, 27 Dec 2008 12:42:00 +0100

Hi,

[the fts patch below is only for reference; it would
 make find misbehave on some non-unix-like file systems]

This started when I noticed find (with no options at all) taking far
too long on a cold file system when run in a directory with just a few
thousand entries.  strace showed that it was calling stat (fstatat64)
on each and every file.  Sure, it happened to be a reiserfs file system
(which lacks dirent.dtype info), but still, we've known about the classic
Unix dir-traversal leaf-optimization technique for ages.  The following
lets ftsfind take advantage of the newly-exposed capability in gnulib's
fts.

For this to be usable, you'll need an ABI-changing patch (below)
for gnulib's fts.  Take this with a grain of salt: I've tested it
only lightly, although I have verified that it works as advertised
(strace+valgrind) on a reiserfs file system, and both findutils and
coreutils still pass "make check" with it.

Making fts enable the optimization by default is not trivial,
because "common wisdom" (or at least find's documentation ;-)
says it may inhibit traversal into msdos or CDROM-based file
systems and AFS mount points are "different".

While writing this, I realized that fts (and find) can perform the
optimization reliably and efficiently, *without* the need for any option
like -noleaf.  I.e., if there's no dirent.dtype info, look up the parent
directory's fs type (call fstatfs only once per distinct device number)
and if it's a type for which the link/dir-counting optimization is valid,
apply the optimization.

If find were to apply this same technique, it would make the
-noleaf predicate a no-op.

FYI, when I run the modified find on a reiserfs-backed 1.6M-file maildir
hierarchy, it takes only 80 seconds (2.6.26, athlon64 3400+, 2yr-old disk),
while using the latest unmodified find, at over 16 minutes, it takes 12
times as long.

I'll post again when I have an fts patch implementing
the approach outlined above.


>From 5ca0d569635c801910c813a9f5c027759bafdf99 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Fri, 26 Dec 2008 18:28:10 +0100
Subject: [PATCH] find: take advantage of new gnulib/fts leaf-optimization

(find): Propagate --noleaf to fts_open via new FTS_NO_LEAF.
* find/ftsfind.c (consider_visiting): Allow state.type to be 0
when fts_info is FTS_NSOK, and set mode accordingly.

This allows find to process an fts entry for which fts_read
returned FTS_NSOK (no stat)
but for which find has no type info.  This happens on any
file system that lacks dirent.dtype information, like reiserfs.
With fts.c from current gnulib, this change is not necessary,
because fts always stats non-dir entries when there is no
dirent.dtype information.

However, combined with the upcoming gnulib change, this lets find
avoid per-non-directory stat-like syscalls (i.e. fstatat)
in some very common cases, like "find . -print" --
which, as you know, can be a huge performance savings.

---
 find/ftsfind.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/find/ftsfind.c b/find/ftsfind.c
index b3d44f8..7a650a2 100644
--- a/find/ftsfind.c
+++ b/find/ftsfind.c
@@ -472,8 +472,8 @@ consider_visiting(FTS *p, FTSENT *ent)
       || ent->fts_info == FTS_NS /* e.g. symlink loop */)
     {
       assert (!state.have_stat);
-      assert (state.type != 0);
-      mode = state.type;
+      assert (ent->fts_info == FTS_NSOK || state.type != 0);
+      mode = (ent->fts_info == FTS_NSOK ? 0 : state.type);
     }
   else
     {
@@ -603,7 +603,10 @@ find(char *arg)

   if (options.stay_on_filesystem)
     ftsoptions |= FTS_XDEV;
-      
+
+  if (options.no_leaf_check)
+    ftsoptions |= FTS_NO_LEAF;
+
   p = fts_open(arglist, ftsoptions, NULL);
   if (NULL == p)
     {
--
1.6.1.302.gccd4d


>From f68e5e1c73f736e2a53889d044c812207ee2e343 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Fri, 26 Dec 2008 19:40:16 +0100
Subject: [PATCH] fts: arrange not to stat non-directories in more cases

* lib/fts_.h (struct [ftsent] (fts_n_dirs_remaining): New member.
* lib/fts.c (fts_stat): Initialize new member (if dir).
(fts_read): Decrement parent's fts_n_dirs_remaining count if
we've just stat'ed a directory.  Skip the stat call when possible.
---
 lib/fts.c  |   22 +++++++++++++++++++++-
 lib/fts_.h |    1 +
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/lib/fts.c b/lib/fts.c
index 836c179..7f57f06 100644
--- a/lib/fts.c
+++ b/lib/fts.c
@@ -791,7 +791,25 @@ check_for_dir:
                if (p->fts_info == FTS_NSOK)
                  {
                    if (p->fts_statp->st_size == FTS_STAT_REQUIRED)
-                     p->fts_info = fts_stat(sp, p, false);
+                     {
+                       FTSENT *parent = p->fts_parent;
+                       if (parent
+                           && parent->fts_n_dirs_remaining == 0
+                           && ISSET(FTS_NOSTAT)
+                           && ISSET(FTS_PHYSICAL))
+                         {
+                           /* nothing more needed */
+                         }
+                       else
+                         {
+                           p->fts_info = fts_stat(sp, p, false);
+                           if (S_ISDIR(p->fts_statp->st_mode))
+                             {
+                               if (parent->fts_n_dirs_remaining)
+                                 parent->fts_n_dirs_remaining--;
+                             }
+                         }
+                     }
                    else
                      fts_assert (p->fts_statp->st_size == 
FTS_NO_STAT_REQUIRED);
                  }
@@ -1560,6 +1578,8 @@ err:              memset(sbp, 0, sizeof(struct stat));
        }

        if (S_ISDIR(sbp->st_mode)) {
+               p->fts_n_dirs_remaining = (sbp->st_nlink
+                                          - (ISSET(FTS_SEEDOT) ? 0 : 2));
                if (ISDOT(p->fts_name)) {
                        /* Command-line "." and ".." are real directories. */
                        return (p->fts_level == FTS_ROOTLEVEL ? FTS_D : 
FTS_DOT);
diff --git a/lib/fts_.h b/lib/fts_.h
index b9554f2..d55075e 100644
--- a/lib/fts_.h
+++ b/lib/fts_.h
@@ -192,6 +192,7 @@ typedef struct _ftsent {
        ptrdiff_t fts_level;            /* depth (-1 to N) */

        size_t fts_namelen;             /* strlen(fts_name) */
+       nlink_t fts_n_dirs_remaining;   /* count down from st_nlink */

 # define FTS_D          1              /* preorder directory */
 # define FTS_DC                 2              /* directory that causes cycles 
*/
--
1.6.1.302.gccd4d




reply via email to

[Prev in Thread] Current Thread [Next in Thread]