[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20768: RFC: Multithreaded grep
bug#20768: RFC: Multithreaded grep
Sun, 7 Jun 2015 23:25:43 -0500
After what looks like 16+ years as an entry in the TODO file (back to
the first commit in the git history), I decided to see if I couldn't
get grep going multithreaded. I now have a working version, so I
figured I'd try to get some feedback on it in its current form and
hopefully ultimately get it included in GNU grep.
At a high level: I've added a new flag, -M/--parallel[=N], that
enables multithreaded operation. The N worker threads (defaulting to
the number of available CPUs) operate in parallel at file granularity,
so it's only of use when operating on multiple files (perhaps via
-R/-r/--directories=recurse). The main thread opens each file and
adds it to a global queue of files to be searched ('workqueue' in
src/grep.c), and the worker threads then pull items off the queue to
For some rough performance numbers, on a 3.5GHz 4-core/8-thread
Haswell running Linux 3.16 with a warm pagecache:
$ time ./src/grep -r FOOB $DIR
$ time ./src/grep -M -r FOOB $DIR
$DIR here is a 15GB tree (24299 directories, 305282 files) of assorted
code -- git repos, tarballs, etc. [Side note: this workload (perhaps
grep in general, not sure) suffered a major performance regression
between v2.20 and v2.21, which I bisected to commit cd36abd4 -- commit
dfff75a4 then recovered some of the lost performance, but it's still
substantially slower than it had been.]
If preferred I can send all the patches directly to the mailing list
(or as one combined diff, whatever's convenient), but for now it's
currently viewable as a series of mostly fairly fine-grained git
commits in the 'multithread' branch at:
(A few of the small cleanup commits might also be of use independently
of the multithreading effort.)
Some notes, caveats and such:
- It doesn't presently build -- all my testing & development has
involved running the final link command manually, because (despite
hours of bashing my head against it) I've been unable to convince
the bootstrap/autotools/gnulib/etc agglomeration to add -lpthread
to the command line. Any advice here would be appreciated.
- There's a slight quirk in command-line interpretation: since the
'-M' short flag takes an optional numeric argument, 'grep -M4' will
be interpreted as '--parallel=4', not '--parallel --context=4'.
This seems safe from a backwards-compatibility standpoint (since no
existing scripts should be using '-M'), but could potentially
produce surprising results if existing scripts are modified in
particular ways (e.g. changing 'grep -R4' to 'grep -RM4'). I'm
open to suggestions on more graceful ways to handle this.
- I'm not really at all familiar with the internal workings of grep's
actual text-search guts, so some of the identifiers I've chosen in
the parts that had to touch those bits (or elsewhere, for that
matter) may not be the best.
- Some of the un-sharing (moving global state into thread-private
structs) is probably a bit heavy-handed; i.e. some things are now
thread-private that could have remained global, but (at least for
now) it was easier to just do it that way.
- Along similar lines, it would probably be nicer to have a way to
just copy the result of `compile()' rather than re-compiling for
- Testing: I've extended a few of the tests in the testsuite to also
run multithreaded, but the coverage isn't exactly extensive.
- I haven't (yet) completed the FSF's copyright-assignment process,
though I have sent the initial email requesting the paperwork.
- Commit messages are in a different style than used in the grep git
repo; I'll fix them up depending on how exactly they should be
- Though I have attempted to follow the GNU coding style, my own
usual coding style differs from it quite a bit, and there may as a
result be some stylistic misfits here and there. I'm happy to fix
any of these anyone points out.
Any/all feedback welcome.
- bug#20768: RFC: Multithreaded grep,
Zev Weiss <=