[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#22357: grep -f not only huge memory usage, but also huge time cost
From: |
Trevor Cordes |
Subject: |
bug#22357: grep -f not only huge memory usage, but also huge time cost |
Date: |
Fri, 9 Dec 2016 01:24:19 -0600 |
User-agent: |
Mutt/1.7.1 (2016-10-04) |
I bisected this bug to commits:
662b19f2d0edc0bf07f1a2c1421080252df4a37c
468d5217ed6ec1679512dec208c7f30fb8612957
(can't narrow it down because the latter doesn't compile for me)
This bug has hit me hard. I have a script that wants to do:
grep -w -f /usr/share/dict/words /tmp/greptest
(good older version: 2 seconds to complete, minimal memory)
(any version after the above commits: 10 or more minutes, never waited for
it to finish, 1.2GB RAM usage and 100% cpu)
Even if /tmp/greptest is empty or has only 1 word in it, this script never
finishes for me, though I didn't wait more than 10m at 100% CPU. It takes
1.2GB of RAM.
If I use the grep version before those commits my sample command above
runs in 2s! 2s!!! And it doesn't use up 1.25GB. The words file is only
5MB!
It's clear that the commit (which is very simple/tiny) is switching grep
into a different mode than it used to and this mode is horribly awful with
large -f input files. I tested the latest HEAD as of yesterday, bug
persists.
This bug, and what are almost certainly duplicates (21763,22239), can
probably be fixed just by backing out or fixing the above commits.
I tried everything suggested in those bugs and I want to note that in my
case I have always had all my locale env vars set to C:
$ locale
LANG=C
LC_CTYPE="C"
...
LC_ALL=C
The locale doesn't change my results.
My results also do not change if I use -i or not (unlike #22239).
My results also do not change if I use -F or not.
I also want to mention that the commit clearly shows something is faulty
with the detection and is probably causing contains_encoding_error() to be
true even though the file in my test (and all the (seq x) tests in this
bug report) /usr/share/dict/words has no multibyte chars, only ascii! How
can a pure ascii file possibly contains_encoding_error? So something with
this whole commit's logic is just plain wrong.
I would strongly suggest everyone would be happier if grep thinks it has a
file that has an encoding it can't deal with that it just errors out and
aborts, rather than switching to a mode that turns a used-to-take 2s (for
the past 10+ years) run into hours and RAM exhaustion! Then the user can
simply clean up his input files to make them compliant. Sounds reasonable
to me, as no program is required to deal with garbage input.
Or we need a switch that can disable this bogus mode switch.
--i-know-what-i-am-doing-stay-in-2s-mode-not-2h-mode
I am able to work on testing patches rapidly if people want to throw ideas
my way. Until this gets fixed I'll just have to maintain my own binary
rpm that reverses that commit.
Thanks!
- bug#22357: grep -f not only huge memory usage, but also huge time cost,
Trevor Cordes <=
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/10
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Trevor Cordes, 2016/12/11
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/11
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Bruno Haible, 2016/12/11
- bug#22357: grep -f not only huge memory usage, but also huge time cost, arnold, 2016/12/11
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/11
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Bruno Haible, 2016/12/12
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/14
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Norihiro Tanaka, 2016/12/17
- bug#22357: grep -f not only huge memory usage, but also huge time cost, Paul Eggert, 2016/12/19