[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Findutils-patches] new predicate
From: |
Konrad Eisele |
Subject: |
Re: [Findutils-patches] new predicate |
Date: |
Thu, 27 May 2010 23:49:46 +0200 |
-------- Original-Nachricht --------
> Datum: Thu, 27 May 2010 15:12:12 -0600
> Von: Eric Blake <address@hidden>
> An: Konrad Eisele <address@hidden>
> CC: address@hidden
> Betreff: Re: [Findutils-patches] new predicate
> On 05/27/2010 02:04 PM, Konrad Eisele wrote:
> > I wanted to submit a patch that is quite short and
> > more thought as a feature request. It adds the predicate
> > "-dtype <regex>" (dtype meaning datatype). The dtype
> > predicate uses libmagic from the "file" command to get
> > the *content datatype* of the file in view, then doing a regex on
> > it. i.e. "echo abc>f.txt; file f.txt" yealds "ASSCII text".
> > Therefore "file f.txt -dtype .*text.*" would do a regex ".*text.*"
> > on "ASCII text" (and match).
>
> Personally, I'm a bit reluctant to add this patch, because you can
> achieve the same effect with more efficient use of existing predicates:
>
> >
> > The problem this patch addresses is like this:
> > I have several source project directory with serveral million
> > files in them. I want to make a backup, however i want
> > to only backup text files, (Makefiles, shell sripts, c and
> > h files etc). Currently I do something like this:
> > (for f in `find <srcdir> -type f`; do if (file $f | cut -d: -f2 | grep
> text &> /dev/null ); then echo $f; fi; done) > file.list
>
> find <srcdir> -type f -exec sh -c \
> 'file "$@" | sed -n "s/:.*text.*//p"' sh {} + > file.list
Now, thanks, I wasnt aware (or able to come up with)
such a expression. For me this works well, my previous
version would run forever, this now is usable. I guess
that even if with my patch it would be faster and
simpler to type it would introduce dependencies
to libmagic that might not be worth the effort.
Here is the results of when running it on the linux
sourcetree:
time /usr/bin/find /usr/src/linux-2.6.29.6/ -type f -exec sh -c 'file "$@" |
sed -n "s/:.*text.*//p"' sh {} + | xargs file $1
real 3m17.519s
user 5m0.162s
sys 0m6.233s
time /usr/bin/find /usr/src/linux-2.6.29.6/ -dtype .*text.* | xargs file $1
real 1m56.629s
user 3m9.618s
sys 0m3.565s
>
> Remember, the reason your version was so slow is that it was spawning a
> subshell, file, cut, and grep command per file; my version uses exec {}
> + to cram as many files as possible per file(1) invocation, then uses
> sed instead of cut|grep for a further reduction in processes.
>
> Meanwhile, be aware that this solution assumes that none of the files
> found will contain : or newline; you may want to add some defensive
> programming into your find expression to reject file names matching
> those patterns.
>
> --
> Eric Blake address@hidden +1-801-349-2682
> Libvirt virtualization library http://libvirt.org
>
--
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01