[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 loc
From: |
Paolo Bonzini |
Subject: |
bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales |
Date: |
Tue, 01 Apr 2014 10:51:59 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 |
Il 17/03/2014 16:01, Norihiro Tanaka ha scritto:
> Package: grep
> Tags: patch
>
> When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer
> to DFA engine to regex's. However, as long as I tested, even after have
> applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in
> non-UTF8 locales.
>
> This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales.
>
> Create the text.
>
> $ yes abcd.abc | head -1000000 > m
>
> I tested below before applying it.
>
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.99
> user 1.75
> sys 0.28
>
> I re-tested after applying it.
>
> $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m
> real 1.21
> user 0.71
> sys 0.46
>
> Norihiro
>
Hi Norihiro,
what about something like this instead (untested)?
Paolo
diff --git a/src/dfa.c b/src/dfa.c
index c06c922..f756194 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -299,6 +299,7 @@ typedef struct
position_set elems; /* Positions this state could match. */
unsigned char context; /* Context from previous state. */
char backref; /* True if this state matches a \<digit>. */
+ bool has_mbcset; /* True if this state matches a MBCSET. */
unsigned short constraint; /* Constraint for this state to accept. */
token first_end; /* Token value of the first END in elems. */
position_set mbps; /* Positions which can match multibyte
@@ -2645,6 +2646,7 @@ dfastate (state_num s, struct dfa *d, state_num trans[])
if (d->states[s].mbps.nelem == 0)
alloc_position_set (&d->states[s].mbps, 1);
insert (pos, &(d->states[s].mbps));
+ d->states[s].has_mbcset |= (d->tokens[pos.index] == MBCSET);
continue;
}
else
@@ -3450,7 +3452,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
better performance (up to 25% better on [a-z], for
example) and enables support for collating symbols and
equivalence classes. */
- if (backref)
+ if (d->states[s].has_mbcset && backref)
{
*backref = 1;
free (mblen_buf);
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales,
Paolo Bonzini <=
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Norihiro Tanaka, 2014/04/01
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Norihiro Tanaka, 2014/04/02
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Norihiro Tanaka, 2014/04/03
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Norihiro Tanaka, 2014/04/07
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Paul Eggert, 2014/04/08
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Paul Eggert, 2014/04/08
- bug#17027: [PATCH] grep: prefer regex to DFA for ANYCHAR in non-UTF8 locales, Paul Eggert, 2014/04/08