bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17376: [PATCH] grep: fix the different behaviour for a invalid seque


From: Norihiro Tanaka
Subject: bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA
Date: Thu, 01 May 2014 07:16:31 +0900

Sorry, tow test cases are wrong.  It's as follows surely.

  encode() { echo "$1" | tr ABC '\357\274\241'; }
  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
  encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
  encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
         ^
  encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
         ^

We will expect none of the commands output anything, but we get 1 row in
4th.  We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.

  return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);

We must check whether it's valid at not the top but a part of last of
matched pattern.  Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'.        ^
                          ^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.

Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on.  No longer tokens are broken in 1st case.

  encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"

  dfaanalyze:
   0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT

I expect below, becuase `encode ABC' is `ef bc a1'.

  dfaanalyze:
    0:ef 1:71 2:OR 3:END 4:CAT

However, It will be also difficult to fix it correctly.  Therefore,
I propose the simple fix in the patch.

Thanks,
Norihiro






reply via email to

[Prev in Thread] Current Thread [Next in Thread]