dfa fix for multibyte char inside square brackets

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

dfa fix for multibyte char inside square brackets

From:	Aharon Robbins
Subject:	dfa fix for multibyte char inside square brackets
Date:	Tue, 25 Jun 2013 08:54:01 +0300
User-agent:	Mutt/1.5.21 (2010-09-15)

Hi All.

A few weeks ago I reported a bug in dfa that was causing an assertion
failure in gawk. It occurred in UTF locales.  Attached are two test
programs, one that fails, and one that doesn't, and input.  The failure
is seen in gawk, I'm not sure how to reproduce in grep.

Mike Haertel was kind enough to track down the problem and fix it for
me. His email is below.  I have already applied his patch to the gawk
code.  I applied it to the dfa in grep 2.14 and grep passed its "make check"
but that is just FYI.

If you (plural) think it makes sense, please apply Mike's fix to
the grep dfa also, so that we can stay in sync.

Thanks!

Arnold

> To: address@hidden
> Subject: tentative fix for DFA bug, with explanation
> Date: Sun, 23 Jun 2013 12:24:03 -0700
> From: Mike Haertel <address@hidden>
> 
> Since this is 7-bit email, let FOO stand for the chinese character
> in the text regexp.  It turns out the following much simpler regexp:
>       ([^.]*[FOO]){1,2}
> is sufficient to cause the crash.
> 
> Some background info: in the first step of its parsing, DFA transforms
> regexp from human readable syntax into reverse-polish form.  For
> example, if the regexp is a|b (where a and b are arbitrary
> subexpressions), the RPN representation looks like:
> 
>       <RPN representation for a>
>       <RPN representation for b>
>       OR
> 
> and if the regexp is ab, the RPN representation is
> 
>       <RPN representation for a>
>       <RPN representation for b>
>       CAT
> 
> For regexps of the form a{m,n} repeat counts, it simply builds
> repeated copies of the representation of a, with appropriate inserted
> CAT and QMARK operators.  For the above example with a regexp of
> the form a{1,2} it would build:
> 
>       <RPN representation for a>
>       <RPN representation for a>
>       QMARK
>       CAT
> 
> When building repeated copies of RPN representations, additional
> copies of the RPN representations are made by calling a function
> copytoks() with arguments consisting of the start position and
> length of the original copy.
> 
> The problem is that the current code for copytoks() is simply
> incorrect.  It operates by calling addtok() for each individual
> token in the source range being copied.  But, in the particular
> case that the token being added is MBCSET, addtok():
> 
> (1) incorrectly assumes that the character set being added to be added
>     is the one most (addtok has no argument to indicate which cset is
>     being added, so it just uses the latest one)
> 
> (2) attempts to do some token sequence expansion into more primitive
>     operators so things like [FOO] are matched efficiently.
> 
> Both of these assumptions are incorrect in the case that addtok()
> is being called from copytoks(): (1) is simply not true, and
> (2) is redundant--the expansion has already been done token sequence
> being copied, so there is no need to do the expansion again.
> 
> Based on my reading of the code, it looks like the correct function to
> add exactly one token, without further expansion, is addtok_mb().
> 
> So here is my proposed fix, which is that copytoks() should never call
> addtok(), but instead directly call addtok_mb() (which is what addtok()
> eventually calls).
> 
> diff --git a/dfa.c b/dfa.c
> index 54e0ae9..2195e28 100644
> --- a/dfa.c
> +++ b/dfa.c
> @@ -1847,13 +1847,12 @@ copytoks (size_t tindex, size_t ntokens)
>  {
>    size_t i;
>  
> -  for (i = 0; i < ntokens; ++i)
> -    {
> -      addtok (dfa->tokens[tindex + i]);
> -      /* Update index into multibyte csets.  */
> -      if (MB_CUR_MAX > 1 && dfa->tokens[tindex + i] == MBCSET)
> -        dfa->multibyte_prop[dfa->tindex - 1] = dfa->multibyte_prop[tindex + 
> i];
> -    }
> +  if (MB_CUR_MAX > 1)
> +    for (i = 0; i < ntokens; ++i)
> +      addtok_mb(dfa->tokens[tindex + i], dfa->multibyte_prop[tindex + i]);
> +  else
> +    for (i = 0; i < ntokens; ++i)
> +      addtok_mb(dfa->tokens[tindex + i], 3);
>  }
>  
>  static void
>  
> With this fix, both tp1.awk and tp2.awk pass.  I haven't tested it on anything
> else, since I have no idea how to run your validation suite (if any).
> 
> So please try this out on your test suite and let me know how it works.
-- 
Aharon (Arnold) Robbins                         arnold AT skeeve DOT com
P.O. Box 354            Home Phone: +972 8 979-0381
Nof Ayalon
D.N. Shimshon 9978500   ISRAEL

tp1.awk
Description: Text document

tp2.awk
Description: dfa_test.in

[Prev in Thread]

Current Thread

[Next in Thread]

dfa fix for multibyte char inside square brackets, Aharon Robbins <=

Prev by Date: Re: [bug #39334] obstack code is non portable
Next by Date: Re: [bug #39334] obstack code is non portable
Previous by thread: [bug #39334] obstack code is non portable
Index(es):
- Date
- Thread