[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#17376: [PATCH] grep: fix the different behaviour for a invalid seque
From: |
Norihiro Tanaka |
Subject: |
bug#17376: [PATCH] grep: fix the different behaviour for a invalid sequence between KWset and DFA |
Date: |
Thu, 01 May 2014 07:16:31 +0900 |
Sorry, tow test cases are wrong. It's as follows surely.
encode() { echo "$1" | tr ABC '\357\274\241'; }
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
encode ABC | env LC_ALL=en_US.utf8 src/grep -F "$(encode A)"
encode aABC | env LC_ALL=en_US.utf8 src/grep "a$(encode A)\|q"
^
encode aABC | env LC_ALL=en_US.utf8 src/grep -F "a$(encode A)"
^
We will expect none of the commands output anything, but we get 1 row in
4th. We need to fix last line in searchutils.c (is_mb_middle) to make
it correctly.
return 0 < match_len && match_len < mbrlen (p, end - p, &cur_state);
We must check whether it's valid at not the top but a part of last of
matched pattern. Now, although checked here: `a$(encode A)', correctly
should be checked here: `a$(encode A)'. ^
^^^^^^^^^^^
However, it may cause slowdown in some typical cases which doesn't include
any invalid sequences, and many users won't hope it.
Further more, I seem that DFA doesn't treat invalid sequence correctly.
I checked it with debugging on. No longer tokens are broken in 1st case.
encode ABC | env LC_ALL=en_US.utf8 src/grep "$(encode A)\|q"
dfaanalyze:
0:c3 1:af 2:CAT 3:71 4:OR 5:END 6:CAT
I expect below, becuase `encode ABC' is `ef bc a1'.
dfaanalyze:
0:ef 1:71 2:OR 3:END 4:CAT
However, It will be also difficult to fix it correctly. Therefore,
I propose the simple fix in the patch.
Thanks,
Norihiro