bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-16 surrogate pair handling in grep -i option


From: Jim Meyering
Subject: Re: UTF-16 surrogate pair handling in grep -i option
Date: Sun, 25 Aug 2013 12:49:20 -0700

On Mon, Aug 19, 2013 at 5:43 AM, Corinna Vinschen <address@hidden> wrote:
> Hi Jim,
>
> On Aug 18 14:26, Jim Meyering wrote:
>> Thank you for adjusting that.  I've added a test script and tweaked a
>> few minor things.
>>
>> I added curly braces around another 1-line else statement, adjusted the
>> commit log to use the explanation from NEWS and added our usual
>> ChangeLog-style entries.
>>
>> In NEWS, I've added the statement that the bug was introduced in 2.6,
>> but that is only a guess.  Can you tell me if that is true?
>
> Well, it depends.  In 2.6, the then-new mbtolower function wouldn't have
> done the right thing with surrogates, but it wouldn't have SEGV'ed,
> either.  The SEGV is a result of introducing the call to
>
>   memset (m + 1, 0, ombclen - 1);
>
> Since ombclen is 0 after encountering the high surrogate, ombclen - 1
> is -1, which, as a size_t value, tries to memset a lot of memory...
>
> This memset call has been introduced with grep 2.13, so this is where
> the problem started.

Thanks.  I've adjusted NEWS to mention that.

> Thanks for writing the testcase.  As you noted above, it fails in
> some cases and works in others, namely, the -F and -iF cases work,
> the other cases don't, since the regex compiler mishandles surrogates.
>
> But, here's a question:  If the surrogate-pair test fails without the
> patch due to the SEGV, and it also fails with the patch, just in a
> different way, what's the idea of the testcase?  In theory, shouldn't
> there be two tests, one of them testing only for this very SEGV, and
> another test testing how grep handles 4 byte UTF-8 values, since that's
> another problem entirely?

It's a trade-off.  Split surrogate-pair testing into two very similar
test scripts?
Factor the similar parts into cfg.sh and use them from two test scripts?
It didn't fee like it was justified in this case, since it's a
cygwin-specific bug.

If there's a short/reliable shell-level test for "is-cygwin", I suppose we
could make the loop that iterates over grep options skip the currently-
known-to-fail cases on Cygwin systems.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]