[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: find and glob

From: Mark Hills
Subject: Re: find and glob
Date: Thu, 29 Mar 2012 18:00:34 +0100 (BST)

Hi James, thank's for your suggestions. I've addressed the points inline, 

Perhaps I wasn't clear, this was an example case and I was hoping for a 
general solution in the find command itself (and suprised to not see one)

On Thu, 29 Mar 2012, James Youngman wrote:

> On Tue, Mar 27, 2012 at 3:24 PM, Mark Hills <address@hidden> wrote:
> > We traverse portions of our filesystem and apply a find action to them;
> > currently by allowing the shell to expand the glob; eg.
> >
> >  find ./*/xx/*/yy
> >
> > But the expansion can be large and problematic before being passed to
> > find.
> I'm not sure what you mean by "problematic" here.   It' spossible I
> suppose that the shell runs out of RAM in which to expand the glob, or
> the results exceed ARG_MAX.    I'll assume it's the latter for the
> purpose of this reply, please correct me if this is the wrong
> interpretation of what you meant.

Yes, that's correct.
> > To do the equivalent in find itself is slow.
> How slow?   How much slower?
> > The whole hierarchy is traversed (which is slow), and only matching results 
> > displayed:
> >
> >  find . -path './*/xx/*/yy'
> You don't state what the structure of your filesystem hierarchy is, so
> it is hard to give entirely reliable advice here.   I'm going to guess
> a bit about things like the depth of the tree (which I'm going to
> guess is large), the total number of files below ".' (also large) and
> the cardinalities of the expansions of "*" in the glob above (also
> large).

Yes, that's correct.

Also, there are several directories alongside the 'xx' and 'yy' 
directories, themselves also large. It would be wasteful to traverse 
these; they cannot and do not match.

> If that is your whole command line you are certainly using find in an
> inefficient way.   It's hard to say for sure since you don't state
> what fraction of the whole filesystem hierarchy you need to visit, or
> what the actions are.    However, the predicates -mindepth, -maxdepth,
> -prune and -quit can be used to limit or terminate the filesystem
> search.

The command is typically followed by checks against the (non-path) 
attributes; eg. -mtime etc. and then -print or -exec.
> > Is there a way to have find itself only visit the relevant portions of the
> > filesystem?
> Certainly.  If I knew quite what you meant by "relevant" I could
> provide a more useful response.   Instead I will provide some
> examples.

To clarify, by "relevant" I meant those which match or could potentially 
match the pattern. At the moment it scans the whole hierarchy.

> We start with your original command, which you state as problematic:
> $  find ./*/xx/*/yy
> I'm going to assume you really meant you use
> $  find ./*/xx/*/yy -actions
> where -actions is some non-empty mixture of find predicates and
> actions.  If -actions already includes -mindepth, -maxdepth, -prune or
> (most awkwardly) -quit, some of the examples below are going to need
> adjustment.
> The simplest rearrangement is
> $ for start in find ./*/xx; do
>   find "${start}"/*/yy -actions
> done
> This will dramatically cut down the number of arguments passed to each
> invocation of find, an so may be enough by itself to form a
> satisfactory solution to your problem.   If the argument count is
> still too  large you could also try:
> $ for start in ./*/xx; do
>   for sub in "${start}"/*/yy; do
>     find "${sub}" -actions
>   done
> done
> If you still have a problem with this second option, it's likely that
> one of the "*"s expands to a sufficiently large list that ARG_MAX is
> still exceeded.   You can overcome this by transforming the loop into
> find predicates.   I'll do this with only the inner loop for
> simplicity:
>   for sub in "${start}"/*/yy; do
>     find "${sub}" -actions
>   done
> becomes
> find "${start}" -mindepth 2 \( -depth 2 \! -name yy -prune , -true \) -actions
> If -actions contains tests like -depth, options like -mindepth or
> -maxdepth, then some adjustment will be needed there.

Thanks for the examples -- you have understood my explanation correctly.

I assume that the examples confirm that this kind of selective traversal 
cannot be done in find itself?

My similar solution, (which automatically interprets the wildcard) was to 
wrap the glob(3) function in a command which outputs to stdout, and use 
that; eg.

  glob './*/xx/*/yy' | xargs -n 100 -I'!!' -- find '!!' -actions

But this, like the examples, is rather unweildy!

> > The manual [1] seems to suggest using locate and xargs. Keeping an index
> > is not practical for us,
> I assume because either the tree changes frequently and multiple
> independent locate indexes would be no help (since all parts of the
> tree change frequently).

The tree changes frequently, and is too large (millions of files) to 
feasably index within usable time.

> > so I wrote a simple command around the glob(3)
> > function to do the traversal and print to stdout. Am I missing some well
> > established method here?
> It's difficult to give a definitive answer here since you don't state
> what you're actually trying to achieve.   I hope the above was useful
> anyway.

Useful suggestion, thank you.

I'm trying to achieve the functinality of glob match within the find 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]