[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000
From: |
Markus Armbruster |
Subject: |
Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8") |
Date: |
Tue, 14 Aug 2018 08:07:34 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) |
Eric Blake <address@hidden> writes:
> On 08/13/2018 02:00 AM, Markus Armbruster wrote:
>> Eric Blake <address@hidden> writes:
>>
>>> On 08/10/2018 10:48 AM, Eric Blake wrote:
>>>> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>>>>> This is consistent with qobject_to_json(). See commit e2ec3f97680.
>>>>
>>>> Side note: that commit mentions that on output, ASCII DEL (0x7f) is
>>>> always escaped. RFC 7159 does not require it to be escaped on input,
>>
>> Weird, isn't it?
>>
>>>> but I wonder if any of your earlier testsuite improvements should
>>>> specifically cover \x7f vs. \u007f on input being canonicalized to
>>>> \u007f on round trip output.
>>
>>>From utf8_string():
>>
>> /* 2.2.1 1 byte U+007F */
>> {
>> "\x7F",
>> "\x7F",
>> "\\u007F",
>> },
>>
>> We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
>> that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
>> that JSON (expecting the C string again). Sufficient?
>
> Indeed, looks like we have coverage of DEL thanks to the bounds
> testing of various interesting UTF-8 inflection points.
>
>>>>> +++ b/qobject/json-lexer.c
>>>>> @@ -93,7 +93,7 @@
>>>>> * interpolation = %((l|ll|I64)[du]|[ipsf])
>>>>> *
>>>>> * Note:
>>>>> - * - Input must be encoded in UTF-8.
>>>>> + * - Input must be encoded in modified UTF-8.
>>>>
>>>> Worth documenting this in the QMP doc as an explicit extension?
>>
>> qmp-spec.txt:
>>
>> The sever expects its input to be encoded in UTF-8, and sends its
>> output encoded in ASCII.
>>
>> The obvious update would be to stick in "modified".
>>
>>>> In
>>>> general, our QMP interfaces that take binary input do so via base64
>>>> encoding, rather than via a modified UTF-8 string -
>>
>> Agreed.
>>
>> However, whether QMP has a use for funny characters or not, the JSON
>> parser has to handle them *somehow*. "Handle" in the broadest possible
>> sense, including "reject". Not including misbehavior like "crash" and
>> "silently ignore some input following ASCII NUL".
>>
>
>>
>>> So it's really
>>> only a question of whether our input engine can pass "\x00"
>>> vs. "\\u0000" when we NEED an input NUL, and except for the testsuite,
>>> our QAPI schema never really needs an input NUL.
>>
>> The question is how the JSON parser is to handle "\u0000" escapes in
>> JSON strings and NUL bytes anywhere.
>>
>> The answer for NUL bytes is obvious: reject them just like any other
>> byte <= 0x1F, as required by the RFC.
>
> Yes, rejecting \x00 on input is fine, which leaves only the escape
> inside strings:
>
>>
>> My answer for \u0000 is to treat it as much like any other codepoint <=
>> 0x1F as possible. Treating it exactly like them isn't possible, because
>> NUL bytes in C strings aren't possible. However, \xC0\x80 sequences
>> are.
>
> Yes, for internal processing to use modified UTF-8 so that we can
> accept "\\u0000" on input is reasonable. And even nicer if we turn a
> QString containing \xc0\x80 back into "\\u0000" on output, so that our
> use of modified UTF-8 never even escapes to the QMP client (and then
> we don't need a documentation change).
We do, in to_json():
for (; *ptr; ptr = end) {
---> cp = mod_utf8_codepoint(ptr, 6, &end);
switch (cp) {
[special cases like \n...]
default:
if (cp < 0) {
cp = 0xFFFD; /* replacement character */
}
if (cp > 0xFFFF) {
/* beyond BMP; need a surrogate pair */
snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
0xD800 + ((cp - 0x10000) >> 10),
0xDC00 + ((cp - 0x10000) & 0x3FF));
} else if (cp < 0x20 || cp >= 0x7F) {
---> snprintf(buf, sizeof(buf), "\\u%04X", cp);
} else {
buf[0] = cp;
buf[1] = 0;
}
qstring_append(str, buf);
}
mod_utf8_codepoint() recognizes \xC0\x80 as codepoint 0. That's an
ASCII control character, and those gets emitted using format \\u%04X.
>>
>> Reasonably simple and consistent, don't you think?
>>
[Qemu-devel] [PATCH 28/56] json: Fix \uXXXX for surrogate pairs, Markus Armbruster, 2018/08/08
[Qemu-devel] [PATCH 44/56] json: Fix latent parser aborts at end of input, Markus Armbruster, 2018/08/08
[Qemu-devel] [PATCH 20/56] check-qjson: Document we expect invalid UTF-8 to be rejected, Markus Armbruster, 2018/08/08