qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000


From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")
Date: Tue, 14 Aug 2018 08:07:34 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Eric Blake <address@hidden> writes:

> On 08/13/2018 02:00 AM, Markus Armbruster wrote:
>> Eric Blake <address@hidden> writes:
>>
>>> On 08/10/2018 10:48 AM, Eric Blake wrote:
>>>> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>>>>> This is consistent with qobject_to_json().  See commit e2ec3f97680.
>>>>
>>>> Side note: that commit mentions that on output, ASCII DEL (0x7f) is
>>>> always escaped. RFC 7159 does not require it to be escaped on input,
>>
>> Weird, isn't it?
>>
>>>> but I wonder if any of your earlier testsuite improvements should
>>>> specifically cover \x7f vs. \u007f on input being canonicalized to
>>>> \u007f on round trip output.
>>
>>>From utf8_string():
>>
>>          /* 2.2.1  1 byte U+007F */
>>          {
>>              "\x7F",
>>              "\x7F",
>>              "\\u007F",
>>          },
>>
>> We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
>> that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
>> that JSON (expecting the C string again).  Sufficient?
>
> Indeed, looks like we have coverage of DEL thanks to the bounds
> testing of various interesting UTF-8 inflection points.
>
>>>>> +++ b/qobject/json-lexer.c
>>>>> @@ -93,7 +93,7 @@
>>>>>     *   interpolation = %((l|ll|I64)[du]|[ipsf])
>>>>>     *
>>>>>     * Note:
>>>>> - * - Input must be encoded in UTF-8.
>>>>> + * - Input must be encoded in modified UTF-8.
>>>>
>>>> Worth documenting this in the QMP doc as an explicit extension?
>>
>> qmp-spec.txt:
>>
>>      The sever expects its input to be encoded in UTF-8, and sends its
>>      output encoded in ASCII.
>>
>> The obvious update would be to stick in "modified".
>>
>>>>                                                                  In
>>>> general, our QMP interfaces that take binary input do so via base64
>>>> encoding, rather than via a modified UTF-8 string -
>>
>> Agreed.
>>
>> However, whether QMP has a use for funny characters or not, the JSON
>> parser has to handle them *somehow*.  "Handle" in the broadest possible
>> sense, including "reject".  Not including misbehavior like "crash" and
>> "silently ignore some input following ASCII NUL".
>>
>
>>
>>>                                                      So it's really
>>> only a question of whether our input engine can pass "\x00"
>>> vs. "\\u0000" when we NEED an input NUL, and except for the testsuite,
>>> our QAPI schema never really needs an input NUL.
>>
>> The question is how the JSON parser is to handle "\u0000" escapes in
>> JSON strings and NUL bytes anywhere.
>>
>> The answer for NUL bytes is obvious: reject them just like any other
>> byte <= 0x1F, as required by the RFC.
>
> Yes, rejecting \x00 on input is fine, which leaves only the escape
> inside strings:
>
>>
>> My answer for \u0000 is to treat it as much like any other codepoint <=
>> 0x1F as possible.  Treating it exactly like them isn't possible, because
>> NUL bytes in C strings aren't possible.  However, \xC0\x80 sequences
>> are.
>
> Yes, for internal processing to use modified UTF-8 so that we can
> accept "\\u0000" on input is reasonable. And even nicer if we turn a
> QString containing \xc0\x80 back into "\\u0000" on output, so that our
> use of modified UTF-8 never even escapes to the QMP client (and then
> we don't need a documentation change).

We do, in to_json():

        for (; *ptr; ptr = end) {
--->        cp = mod_utf8_codepoint(ptr, 6, &end);
            switch (cp) {
            [special cases like \n...]
            default:
                if (cp < 0) {
                    cp = 0xFFFD; /* replacement character */
                }
                if (cp > 0xFFFF) {
                    /* beyond BMP; need a surrogate pair */
                    snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
                             0xD800 + ((cp - 0x10000) >> 10),
                             0xDC00 + ((cp - 0x10000) & 0x3FF));
                } else if (cp < 0x20 || cp >= 0x7F) {
--->                snprintf(buf, sizeof(buf), "\\u%04X", cp);
                } else {
                    buf[0] = cp;
                    buf[1] = 0;
                }
                qstring_append(str, buf);
            }

mod_utf8_codepoint() recognizes \xC0\x80 as codepoint 0.  That's an
ASCII control character, and those gets emitted using format \\u%04X.

>>
>> Reasonably simple and consistent, don't you think?
>>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]