[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in
From: |
Blue Swirl |
Subject: |
Re: [Qemu-devel] [PATCH] check-qjson: More thorough testing of UTF-8 in strings |
Date: |
Thu, 28 Feb 2013 18:28:05 +0000 |
On Thu, Feb 28, 2013 at 8:14 AM, Markus Armbruster <address@hidden> wrote:
> Blue Swirl <address@hidden> writes:
>
>> On Mon, Feb 4, 2013 at 5:19 PM, Markus Armbruster <address@hidden> wrote:
>>> Test cases are scraped from Markus Kuhn's UTF-8 decoder capability and
>>> stress test at
>>> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>>
>>> Unfortunately, both JSON parser and formatter misbehave right now.
>>> This test expects current, incorrect results. They're all clearly
>>> marked, and are to be replaced by correct ones as the bugs get fixed.
>>> See comments in new utf8_string() for details.
>>>
>>> Signed-off-by: Markus Armbruster <address@hidden>
>>> ---
>>> tests/check-qjson.c | 625
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 625 insertions(+)
>>>
>>> diff --git a/tests/check-qjson.c b/tests/check-qjson.c
>>> index 32ffb43..4590b3a 100644
>>> --- a/tests/check-qjson.c
>>> +++ b/tests/check-qjson.c
>>> @@ -1,8 +1,10 @@
>>> /*
>>> * Copyright IBM, Corp. 2009
>>> + * Copyright (c) 2013 Red Hat Inc.
>>> *
>>> * Authors:
>>> * Anthony Liguori <address@hidden>
>>> + * Markus Armbruster <address@hidden>,
>>> *
>>> * This work is licensed under the terms of the GNU LGPL, version 2.1 or
>>> later.
>>> * See the COPYING.LIB file in the top-level directory.
>>> @@ -131,6 +133,628 @@ static void single_quote_string(void)
>>> }
>>> }
>>>
>>> +static void utf8_string(void)
>>> +{
>>> + /*
>>> + * FIXME Current behavior for invalid UTF-8 sequences is
>>> + * incorrect. This test expects current, incorrect results.
>>> + * They're all marked "bug:" below, and are to be replaced by
>>> + * correct ones as the bugs get fixed.
>>> + *
>>> + * The JSON parser rejects some invalid sequences, but accepts
>>> + * others without correcting the problem.
>>> + *
>>> + * The JSON formatter replaces some invalid sequences by U+FFFF (a
>>> + * noncharacter), and goes wonky for others.
>>> + *
>>> + * For both directions, we should either reject all invalid
>>> + * sequences, or minimize overlong sequences and replace all other
>>> + * invalid sequences by a suitable replacement character. A
>>> + * common choice for replacement is U+FFFD.
>>> + *
>>> + * Problem: we can't easily deal with embedded U+0000. Parsing
>>> + * the JSON string "this \\u0000" is fun" yields "this \0 is fun",
>>> + * which gets misinterpreted as NUL-terminated "this ". We should
>>> + * consider using overlong encoding \xC0\x80 for U+0000 ("modified
>>> + * UTF-8").
>>> + *
>>> + * Tests are scraped from Markus Kuhn's UTF-8 decoder capability
>>> + * and stress test at
>>> + * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>>> + */
>>> + static const struct {
>>> + const char *json_in;
>>> + const char *utf8_out;
>>> + const char *json_out; /* defaults to @json_in */
>>> + const char *utf8_in; /* defaults to @utf8_out */
>>> + } test_cases[] = {
>>> + /*
>>> + * Bug markers used here:
>>> + * - bug: not corrected
>>> + * JSON parser fails to correct invalid sequence(s)
>>> + * - bug: rejected
>>> + * JSON parser rejects invalid sequence(s)
>>> + * We may choose to define this as feature
>>> + * - bug: want "\"...\""
>>> + * JSON formatter produces incorrect result, this is the
>>> + * correct one, assuming replacement character U+FFFF
>>> + * - bug: want "..." (no \")
>>> + * JSON parser produces incorrect result, this is the
>>> + * correct one.
>>> + * Not marked explicitly, but trivial to find:
>>> + * - JSON formatter replacing invalid sequence by \\uFFFF is a
>>> + * bug if we want it to fail for invalid sequences.
>>> + */
> [...]
>>> + /* 2.1.4 4 bytes U+10000 */
>>> + {
>>> + "\"\xF0\x90\x80\x80\"",
>>> + "\xF0\x90\x80\x80",
>>> + "\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
>>> + },
> [...]
>>> + {}
>>> + };
>>> + int i;
>>> + QObject *obj;
>>> + QString *str;
>>> + const char *json_in, *utf8_out, *utf8_in, *json_out;
>>> +
>>> + for (i = 0; test_cases[i].json_in; i++) {
>>> + json_in = test_cases[i].json_in;
>>> + utf8_out = test_cases[i].utf8_out;
>>> + utf8_in = test_cases[i].utf8_in ?: test_cases[i].utf8_out;
>>> + json_out = test_cases[i].json_out ?: test_cases[i].json_in;
>>> +
>>> + obj = qobject_from_json(json_in);
>>> + if (utf8_out) {
>>> + g_assert(obj);
>>> + g_assert(qobject_type(obj) == QTYPE_QSTRING);
>>> + str = qobject_to_qstring(obj);
>>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out);
>>> + } else {
>>> + g_assert(!obj);
>>> + }
>>> + qobject_decref(obj);
>>> +
>>> + obj = QOBJECT(qstring_from_str(utf8_in));
>>> + str = qobject_to_json(obj);
>>> + if (json_out) {
>>> + g_assert(str);
>>> + g_assert_cmpstr(qstring_get_str(str), ==, json_out);
>>
>> This assertion trips on an ARM host (Debian stable):
>>
>> GTESTER tests/check-qjson
>> **
>> ERROR:/src/qemu/tests/check-qjson.c:775:utf8_string: assertion failed
>> (qstring_get_str(str) == json_out): ("\"\\u0400\200\"" ==
>> "\"\\u0400\\uFFFF\"")
>> GTester: last random seed: R02S88b76f755e809e9024832d2ab6660afd
>
> Must be case 2.1.4, because that's where json_out is
> "\"\\u0400\\uFFFF\"".
>
> We start by passing C string "\"\xF0\x90\x80\x80\"" to the JSON parser
> qobject_from_json(). Yields "\xF0\x90\x80\x80", as expected.
>
> We then pass that to qobject_to_json(). Should yield
> "\"\\uD800\\uDC00\"" (surrogate pair). Does yield "\"\\u0400\\uFFFF\""
> on my machine (known bug), and "\"\\u0400\200\"" on yours.
>
> Looks like the JSON formatter is not just broken (we knew that already),
> it's broken in machine-dependent ways. Good to know, thanks for
> reporting.
>
> Obvious ways to get "make check" pass for you again *now*:
>
> * Disable check-qjson. That's too big a hammer for me.
>
> * Disable test case 2.1.4 with a comment explaining why.
>
> * Suitable #ifdeffery around the expected value.
>
> Preferences?
* Fix JSON formatter :-)
Disabling 2.1.4 only reveals the next problem:
GTESTER tests/check-qjson
GTester: last random seed: R02S6754f3523201dc81bb21de42e2ba843c
**
ERROR:/src/qemu/tests/check-qjson.c:777:utf8_string: assertion failed
(qstring_get_str(str) == json_out): ("\"\\u8200\200\200\"" ==
"\"\\u8200\\uFFFF\\uFFFF\"")
diff --git a/tests/check-qjson.c b/tests/check-qjson.c
index ec85a0c..fecd9c1 100644
--- a/tests/check-qjson.c
+++ b/tests/check-qjson.c
@@ -223,12 +223,14 @@ static void utf8_string(void)
"\xE0\xA0\x80",
"\"\\u0800\"",
},
+#if 0
/* 2.1.4 4 bytes U+10000 */
{
"\"\xF0\x90\x80\x80\"",
"\xF0\x90\x80\x80",
"\"\\u0400\\uFFFF\"", /* bug: want "\"\\uD800\\uDC00\"" */
},
+#endif
/* 2.1.5 5 bytes U+200000 */
{
"\"\xF8\x88\x80\x80\x80\"",
>
>>> + } else {
>>> + g_assert(!str);
>>> + }
>>> + QDECREF(str);
>>> + qobject_decref(obj);
>>> +
>>> + /*
>>> + * disabled, because json_out currently contains the crap
>>> + * qobject_to_json() produces
>>> + */
>>> + if (0 && json_out != json_in) {
>>> + obj = qobject_from_json(json_out);
>>> + g_assert(obj);
>>> + g_assert(qobject_type(obj) == QTYPE_QSTRING);
>>> + str = qobject_to_qstring(obj);
>>> + g_assert_cmpstr(qstring_get_str(str), ==, utf8_out);
>>> + }
>>> + }
>>> +}
>>> +
>>> static void vararg_string(void)
>>> {
>>> int i;
> [...]