bug-findutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] find: doc: Fix -prune SCM example directories and make it mo


From: raf
Subject: Re: [PATCH] find: doc: Fix -prune SCM example directories and make it more efficient
Date: Wed, 23 Nov 2022 19:07:14 +1100

On Tue, Nov 22, 2022 at 11:12:00PM +0100, Bernhard Voelker 
<mail@bernhard-voelker.de> wrote:

> Hi raf,

Hi Berny,

> thanks for the patch.
> Besides busy day job times here, I was actually waiting for the FSF legal
> to add your entry into the 'copyright.list' file on the GNU fencepost server.
> It's not yet in place though, so no need to hurry.

No worries. That'll take some time. I've emailed the
copyright assignment request form to Craig Topham, but
the paperwork has yet to arrive here for signing.

> On 11/18/22 23:48, raf wrote:
> > * find/find.1 - Fix -prune SCM example directories and make it more 
> > efficient
> > * doc/find.texi - Make -prune SCM example more efficient (two ways)
> > * NEWS - Mention the above
> > ---
> >   NEWS          |  3 +++
> >   doc/find.texi | 25 +++++++++++++++++++++----
> >   find/find.1   | 13 ++++++++++++-
> >   3 files changed, 36 insertions(+), 5 deletions(-)
> > 
> > diff --git a/NEWS b/NEWS
> > index 1fff34f8..7c6fcfb3 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -18,6 +18,9 @@ GNU findutils NEWS - User visible changes.      -*- 
> > outline -*- (allout)
> >     When generating the Texinfo manual, `makeinfo` is invoked with the 
> > --no-split
> >     option for all output formats now; this avoids files like 
> > find.info-[12].
> > +  The find.1 manual's -prune SCM example directories/output have been 
> > fixed, and
> > +  the example itself has been made more efficient (find.1 and find.texi) 
> > [#62259]
> > +  Also fixed a typo in find.texi.
> >   * Noteworthy changes in release 4.9.0 (2022-02-22) [stable]
> > diff --git a/doc/find.texi b/doc/find.texi
> > index 379fe646..1a36e243 100644
> > --- a/doc/find.texi
> > +++ b/doc/find.texi
> > @@ -5138,13 +5138,15 @@ already found.
> >   @smallexample
> >   find repo/ \
> > --exec test -d @{@}/.svn \; -or \
> > --exec test -d @{@}/.git \; -or \
> > --exec test -d @{@}/CVS \; -print -prune
> > +-type d \
> > +\( -exec test -d @{@}/.svn \; \
> > +-or -exec test -d @{@}/.git \; \
> > +-or -exec test -d @{@}/CVS \; \
> > +\) -print -prune
> >   @end smallexample
> 
> Indeed, the bug omitting the \( ... \) has been introduced in 2008:
> 
>   $ git diff v4.5.6b-28-gaca33f85^..v4.5.6b-28-gaca33f85 -- doc/find.texi | 
> tail -n12
>   @@ -4683,7 +4683,10 @@ searching subdirectories inside projects whose SCM 
> directory we
>    already found.
> 
>    @smallexample
>   -find repo/ -exec test -d @{@}/.svn -o -d @{@}/.git -o -d @{@}/CVS \; 
> -print -prune
>   +find repo/ \
>   +-exec test -d @{@}/.svn \; -or \
>   +-exec test -d @{@}/.git \; -or \
>   +-exec test -d @{@}/CVS \; -print -prune
>    @end smallexample
> 
>    In this example, @command{test} is used to tell if we are currently
> 
> For find.1, this was already fixed in commit v4.6.0-55-g47d8fd38, but missed
> to check the analog place in the texi file.
> 
> Your patch above aims at optimizing things by letting 'find -type d' 
> pre-filter
> directories in order to avoid -exec invocations on regular and other 
> non-directory
> files.
> The '-type d' test is different from 'test -d ...' because the latter 
> transparently
> follows symbolic links while the former strictly checks on the type of the 
> entry.
> Therefore, the patch changes the result in the case the repo workspace is a 
> symlink:
> the new version would skip it.  Well, adding -L would help.

Ah, I hadn't thought of that. I can add -L.

> >   In this example, @command{test} is used to tell if we are currently
> > -examining a directory which appears to the a project's root directory
> > +examining a directory which appears to be a project's root directory
> 
> good catch!
> 
> >   (because it has an SCM subdirectory).  When we find a project root,
> >   there is no need to search inside it, and @code{-prune} makes sure
> >   that we descend no further.
> > @@ -5153,6 +5155,21 @@ For large, complex trees like the Linux kernel, this 
> > will prevent
> >   searching a large portion of the structure, saving a good deal of
> >   time.
> > +The @samp{-type d} clause causes the three @samp{test} shell
> > +processes to only be executed for directories. This can be made even
> > +more efficient by combining the three @samp{test} shell processes
> > +into a single process:
> > +
> > +@smallexample
> > +find repo/ \
> > +-type d \
> > +-exec sh -c 'test -d "$1"/.svn || test -d "$1"/.git || test -d "$1"/CVS' . 
> > {} \; \
> > +-print -prune
> > +@end smallexample
> > +
> > +Note that the @samp{.} argument is just a placeholder for the unused
> > +@samp{$0} environment variable in the @samp{sh -c} command. The
> > +@samp{@{@}} argument is the @samp{$1} environment variable.
> _______________________________^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> s/environment/shell/, and maybe s/variable/parameter/

Are you sure? The term "environment variable" is a
fairly standard term. The find documentation (manual
and info) already uses it at least 16 times. There are
no appearances there of the term "shell parameter".

Should I change it anyway?

> >   @node Security Considerations
> >   @chapter Security Considerations
> 
> I'm still unsure about efficiency here.  Summary:
> 
> 1) Originally it was:
> 
>   $ find repo/ -exec test -d '{}/.svn' -o -d '{}/.git' -o -d '{}/CVS' \; 
> -print -prune
> 
> -> 1 process, doing the OR-ing itself.  Nice and neat, and efficient, but 
> maybe a bit
> confusing for the novice reader: it may not be obvious that test(1) is doing 
> the OR-ing,
> because -o is also a find(1) operator.
> Besides readability, the test -o operator is discouraged.
> 
>   $ env '[' --help
>   ...
>   NOTE: Binary -a and -o are inherently ambiguous.  Use 'test EXPR1 && test
>   EXPR2' or 'test EXPR1 || test EXPR2' instead.
>   ...
> 
> 2) Now there's the 3x -exec, with fixed -or grouping (with or without '-type 
> d'):
> 
>   $ find repo/ -type d \
>       '(' -exec test -d '{}/.svn' \; -or \
>           -exec test -d '{}/.git' \; -or \
>           -exec test -d '{}/CVS' \; \
>       ')' -print -prune
> 
> -> less efficient than 1) because it launches 3x test(1) via -exec.
> OTOH good to read.
> 
> 3. Then there's 1x -exec to launch a shell process to do the OR-ing:
> 
>   $ find repo/ \
>       -type d \
>       -exec sh -c 'test -d "$1"/.svn || test -d "$1"/.git || test -d 
> "$1"/CVS' . {} \; \
>       -print -prune
> 
> While there's only 1x -exec per directory, I'm wondering if launching one
> shell process with 3x test (which is usually the shell builtin) is really
> more efficient than 3x launching the executable found by `which test`.
> 
> We don't have a measurement, and the results may vary on different systems,
> but I could imagine the launching a shell is more expensive than launching
> 3x /usr/bin/test.
>
> Anyhow, I like the discussion, and I believe it's good to have several 
> alternatives
> documented with their individual pros vs. cons considerations.
> WDYT?

I just did a quick check (Debian vm on a mature macbookpro)
and 1000 * 3 * /usr/bin/test is 3.8s and 1000 * /bin/sh is
1.3s (~3x faster). I'm not surprised. dash is only 2x the
size of test (both small), and they load the same shared
libraries.

The original version of this patch just had the single
-exec version (my preference), but it was thought better to
leave the three -exec version in place, and add the single
-exec version as an alternative.

I think either is OK. The more important difference is
adding -test d and not executing any sh/test processes
for every non-directory.

> > diff --git a/find/find.1 b/find/find.1
> > index 429aa2f0..4faaa23b 100644
> > --- a/find/find.1
> > +++ b/find/find.1
> > @@ -2494,6 +2494,7 @@ projects' roots:
> >   .in +4m
> >   .B $ find repo/ \e
> >   .in +4m
> > +.B \-type d \e
> >   .B \e( \-exec test \-d \(aq{}/.svn\(aq \e; \e
> >   .B \-or \-exec test \-d \(aq{}/.git\(aq \e; \e
> >   .B \-or \-exec test \-d \(aq{}/CVS\(aq \e; \e
> > @@ -2502,7 +2503,7 @@ projects' roots:
> >   .in -4m
> >   \&
> >   .fi
> > -Sample output:
> > +Sample directories:
> >   .nf
> >   \&
> >   .in +4m
> > @@ -2513,6 +2514,16 @@ Sample output:
> >   .B repo/project4/.git
> >   .in
> >   \&
> > +Sample output:
> > +.nf
> > +\&
> > +.in +4m
> > +.B repo/project1
> > +.B repo/gnu/project2
> > +.B repo/gnu/project3
> > +.B repo/project4
> > +.in
> > +\&
> >   .fi
> >   In this example,
> >   .B \-prune
> 
> +1 on this.
> 
> Have a nice day,
> Berny

Thanks. You too.

cheers,
raf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]