[bug #16305] grep much less efficient when matching multiple patterns th

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #16305] grep much less efficient when matching multiple patterns th

From:	Levi Waldron
Subject:	[bug #16305] grep much less efficient when matching multiple patterns than when matching each pattern sequentially
Date:	Sat, 8 Apr 2006 20:57:19 +0000
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20051010 Firefox/1.0.7 (Ubuntu package 1.0.7)

URL:
  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=16305>

                 Summary: grep much less efficient when matching multiple
patterns than when matching each pattern sequentially
                 Project: grep
            Submitted by: lwaldron
            Submitted on: Saturday 04/08/06 at 20:57
                Category: None
                Severity: 3 - Normal
              Item Group: None
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open

    _______________________________________________________

Details:

I have a list of patterns which is about 4,000 lines like this:
(patterns.txt):

PTPAFFX.131946.1.S1_S_AT
PTPAFFX.11573.1.A1_AT
PTPAFFX.209184.1.S1_AT
PTP.3766.1.S1_AT
PTP.3804.1.S1_AT

And a data file which is about 7,000 lines like this, totalling an approx.
1MB file: (data.txt)

AFFX-BIOB-3_AT 1429.6 2545.4 816.966666666667 1646.9 1698.96666666667
2819.06666666667 1085.33333333333 1915.26666666667 0.99999721095669
0.999997210956687
PTPAFFX.126566.1.S1_AT 2442.5 2636.96666666667 2341.06666666667
2244.76666666667 2604.96666666667 2997.93333333333 2399.96666666667 2207.4
0.999995178917582 0.999995178917537
PTPAFFX.212425.1.S1_AT 496.366666666667 551 430.433333333333 482.466666666667
517.6 642.966666666667 371.533333333333 487.766666666667 0.99989956995976
0.999899569959758

(each line in data.txt starts with the string variable which might match a
pattern in patterns.txt)

Every pattern in pattern.txt has a match somewhere in data.txt.  When I run
this search like this:

grep --file=patterns.txt data.txt > matches.txt

it is *extremely* consumptive of memory and CPU.  On my 2GHz Celeron with
512MB RAM it uses almost all the 1GB swap space and would take probably 12
hours if I were to let it finish.  I've even had the opportunity to run it on
a large Beowulf cluster
(http://www.botany.utoronto.ca/bbc_access/Botany_Beowulf_Cluster.htm) and
after 3 minutes this method still hadn't found a single match.

If I instead run the search one pattern at a time like:

for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done

it uses a small amount of memory and completes all 4,000 matches on my home
computer in maybe 15-30 minutes and on the beowulf cluster in less than 30
seconds.

Perhaps GNU grep would generally run faster with numerous patterns if it
searched for them one at a time?  Also, why is this job so CPU and memory
intensive?

Both the cluster and my personal computer are running GNU grep 2.5.1.



    _______________________________________________________

Carbon-Copy List:

CC Address                          | Comment
------------------------------------+-----------------------------
lwaldron                            | 




    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?func=detailitem&item_id=16305>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[bug #16305] grep much less efficient when matching multiple patterns than when matching each pattern sequentially, Levi Waldron <=

Next by Date: Re: grep -f scales extremely poorly with number of lines in pattern file
Next by thread: Re: grep -f scales extremely poorly with number of lines in pattern file
Index(es):
- Date
- Thread