[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 1/2] dfa: fix multibyte character in brackets with repetition

From: Paolo Bonzini
Subject: [PATCH 1/2] dfa: fix multibyte character in brackets with repetition
Date: Fri, 5 Jul 2013 13:28:23 +0200

From: Mike Haertel <address@hidden>

Let FOO stand for any multibyte (e.g. CJK character) in the regexp.
It turns out the following much simpler regexp:
is sufficient to cause the crash.

In the first step of its parsing, DFA transforms regexp from human
readable syntax into reverse-polish form.  For regexps of the form a{m,n}
repeat counts, it simply builds repeated copies of the representation
of a, with appropriate inserted CAT and QMARK operators.  For the above
example with a regexp of the form a{1,2} it would build:

        <RPN representation for a>
        <RPN representation for a>

When building repeated copies of RPN representations, additional
copies of the RPN representations are made by calling a function
copytoks() with arguments consisting of the start position and
length of the original copy.

The problem is that the current code for copytoks() is simply
incorrect.  It operates by calling addtok() for each individual
token in the source range being copied.  But, in the particular
case that the token being added is MBCSET, addtok():

(1) incorrectly assumes that the character set being added to be added
    is the one most (addtok has no argument to indicate which cset is
    being added, so it just uses the latest one)

(2) attempts to do some token sequence expansion into more primitive
    operators so things like [FOO] are matched efficiently.

Both of these assumptions are incorrect in the case that addtok()
is being called from copytoks(): (1) is simply not true, and
(2) is redundant--the expansion has already been done token sequence
being copied, so there is no need to do the expansion again.

The correct function to add exactly one token, without further expansion,
is addtok_mb().  So here is my proposed fix, which is that copytoks()
should never call addtok(), but instead directly call addtok_mb()
(which is what addtok() eventually calls).

* src/dfa.c (copytoks): Rewrite using addtok_mb directly.
 src/dfa.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index fe08f34..abf620d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1794,13 +1794,12 @@ copytoks (size_t tindex, size_t ntokens)
   size_t i;
-  for (i = 0; i < ntokens; ++i)
-    {
-      addtok (dfa->tokens[tindex + i]);
-      /* Update index into multibyte csets.  */
-      if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET)
-        dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex + i];
-    }
+  if (MB_CUR_MAX > 1)
+    for (i = 0; i < ntokens; ++i)
+      addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]);
+  else
+    for (i = 0; i < ntokens; ++i)
+      addtok_mb(dfa->tokens[tindex + i], 3);
 static void

reply via email to

[Prev in Thread] Current Thread [Next in Thread]