bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 loc


From: Paolo Bonzini
Subject: bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales
Date: Tue, 01 Apr 2014 10:51:59 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Il 17/03/2014 16:01, Norihiro Tanaka ha scritto:
> Package: grep
> Tags: patch
> 
> When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
> to DFA engine to regex's.  However, as long as I tested, even after have
> applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
> non-UTF8 locales.
> 
> This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.
> 
> Create the text.
> 
> $ yes abcd.abc | head -1000000 > m
> 
> I tested below before applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.99
> user 1.75
> sys 0.28
> 
> I re-tested after applying it.
> 
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.21
> user 0.71
> sys 0.46
> 
> Norihiro
> 

Hi Norihiro,

what about something like this instead (untested)?

Paolo

diff --git a/src/dfa.c b/src/dfa.c
index c06c922..f756194 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -299,6 +299,7 @@ typedef struct
   position_set elems;           /* Positions this state could match.  */
   unsigned char context;        /* Context from previous state.  */
   char backref;                 /* True if this state matches a \<digit>.  */
+  bool has_mbcset;              /* True if this state matches a MBCSET.  */
   unsigned short constraint;    /* Constraint for this state to accept.  */
   token first_end;              /* Token value of the first END in elems.  */
   position_set mbps;            /* Positions which can match multibyte
@@ -2645,6 +2646,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
           if (d->states[s].mbps.nelem == 0)
             alloc_position_set (&d->states[s].mbps, 1);
           insert (pos, &(d->states[s].mbps));
+          d->states[s].has_mbcset |= (d->tokens[pos.index] == MBCSET);
           continue;
         }
       else
@@ -3450,7 +3452,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
                  better performance (up to 25% better on [a-z], for
                  example) and enables support for collating symbols and
                  equivalence classes.  */
-              if (backref)
+              if (d->states[s].has_mbcset && backref)
                 {
                   *backref = 1;
                   free (mblen_buf);






reply via email to

[Prev in Thread] Current Thread [Next in Thread]