bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6


From: Jim Meyering
Subject: bug#44754: Extreme performance degradation in GNU grep 3.4 / 3.6
Date: Wed, 25 Nov 2020 15:12:04 -1000

On Thu, Nov 19, 2020 at 7:32 PM Frank Heckenbach
<f.heckenbach@fh-soft.de> wrote:
> I have a use case where I run grep with a large number of search
> patterns on a large text file. It works well with grep-3.3, but with
> grep-3.4 it quickly burned through GBs of memory and almost locked
> up my system due to swapping.
>
> To avoid attaching those large files, I could mostly reproduce the
> effects like this:
>
> ulimit -d 5000000  # avoid system lockup due to excessive swapping
> export LC_ALL=C    # make sure no Unicode case conversions are needed
>
> % time  ./grep-3.3 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real    0m0.054s
> user    0m0.048s
> sys     0m0.012s
>
> % time  ./grep-3.4 -Fwif <(seq 30000 | tr 0-9 A-J) <<<foo
> ./grep-3.4: Memory exhausted
> Aborted
>
> real    0m1.291s
> user    0m0.696s
> sys     0m0.599s
>
> % time  ./grep-3.6 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo
>
> real    0m13.162s
> user    0m12.955s
> sys     0m0.211s
>
> grep-3.3 behaves well, even with much larger number of patterns.
> Time seems to grow linearly, and memory usage is constant.
>
> grep-3.4 behaves the worst of these 3 versions. Even with just 30000
> patterns it exceeds the ulimit of 5 GB.
>
> grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to
> be quadratic in the number of patterns, and though memory usage in
> this case seems to be almost constant, in my actual use case it also
> runs out of memory where grep-3.3 works well with just a few 100 MB
> used.
>
> Without "-i", grep-3.4 seems to run as fast as grep-3.3, but
> grep-3.6 is almost as slow as with "-i".
>
> So there might actually be two different issues here, one that
> affects 3.4 with "-i" and one that affects 3.6 with or without "-i".

Thank you for the fine bug report.
The grep-3.6 bug you've exposed is due to the fact that your input
triggers excessive hash collisions when using the code modeled after
gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase
take O(N^2) time for N patterns. In the attached, I've switched grep
to use the djb2 hash function, and that resolves the problem. I'll
also add a NEWS entry and a test before pushing this.

Attachment: 0001-grep-avoid-performance-regression-with-many-patterns.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]