[Qemu-devel] [PULL 24/58] json: Leave rejecting invalid UTF-8 to parser

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Qemu-devel] [PULL 24/58] json: Leave rejecting invalid UTF-8 to parser

From:	Markus Armbruster
Subject:	[Qemu-devel] [PULL 24/58] json: Leave rejecting invalid UTF-8 to parser
Date:	Fri, 24 Aug 2018 21:31:32 +0200

Both the lexer and the parser (attempt to) validate UTF-8 in JSON
strings.

The lexer rejects bytes that can't occur in valid UTF-8: \xC0..\xC1,
\xF5..\xFF.  This rejects some, but not all invalid UTF-8.  It also
rejects ASCII control characters \x00..\x1F, in accordance with RFC
8259 (see recent commit "json: Reject unescaped control characters").

When the lexer rejects, it ends the token right after the first bad
byte.  Good when the bad byte is a newline.  Not so good when it's
something like an overlong sequence in the middle of a string.  For
instance, input

    {"abc\xC0\xAFijk": 1}\n

produces the tokens

    JSON_LCURLY   {
    JSON_ERROR    "abc\xC0
    JSON_ERROR    \xAF
    JSON_KEYWORD  ijk
    JSON_ERROR   ": 1}\n

The parser then reports four errors

    Invalid JSON syntax
    Invalid JSON syntax
    JSON parse error, invalid keyword 'ijk'
    Invalid JSON syntax

before it recovers at the newline.

The commit before previous made the parser reject invalid UTF-8
sequences.  Since then, anything the lexer rejects, the parser would
reject as well.  Thus, the lexer's rejecting is unnecessary for
correctness, and harmful for error reporting.

However, we want to keep rejecting ASCII control characters in the
lexer, because that produces the behavior we want for unclosed
strings.

We also need to keep rejecting \xFF in the lexer, because we
documented that as a way to reset the JSON parser
(docs/interop/qmp-spec.txt section 2.6 QGA Synchronization), which
means we can't change how we recover from this error now.  I wish we
hadn't done that.

I think we should treat \xFE the same as \xFF.

Change the lexer to accept \xC0..\xC1 and \xF5..\xFD.  It now rejects
only \x00..\x1F and \xFE..\xFF.  Error reporting for invalid UTF-8 in
strings is much improved, except for \xFE and \xFF.  For the example
above, the lexer now produces

    JSON_LCURLY   {
    JSON_STRING   "abc\xC0\xAFijk"
    JSON_COLON    :
    JSON_INTEGER  1
    JSON_RCURLY

and the parser reports just

    JSON parse error, invalid UTF-8 sequence in string

Signed-off-by: Markus Armbruster <address@hidden>
Reviewed-by: Eric Blake <address@hidden>
Message-Id: <address@hidden>
---
 qobject/json-lexer.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index 902fe60846..93fa2737e6 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -177,8 +177,7 @@ static const uint8_t json_lexer[][256] =  {
         ['u'] = IN_DQ_UCODE0,
     },
     [IN_DQ_STRING] = {
-        [0x20 ... 0xBF] = IN_DQ_STRING,
-        [0xC2 ... 0xF4] = IN_DQ_STRING,
+        [0x20 ... 0xFD] = IN_DQ_STRING,
         ['\\'] = IN_DQ_STRING_ESCAPE,
         ['"'] = JSON_STRING,
     },
@@ -217,8 +216,7 @@ static const uint8_t json_lexer[][256] =  {
         ['u'] = IN_SQ_UCODE0,
     },
     [IN_SQ_STRING] = {
-        [0x20 ... 0xBF] = IN_SQ_STRING,
-        [0xC2 ... 0xF4] = IN_SQ_STRING,
+        [0x20 ... 0xFD] = IN_SQ_STRING,
         ['\\'] = IN_SQ_STRING_ESCAPE,
         ['\''] = JSON_STRING,
     },
-- 
2.17.1

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PULL 25/58] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8"), (continued)
- [Qemu-devel] [PULL 25/58] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8"), Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 01/58] docs/interop/qmp-spec: How to force known good parser state, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 15/58] check-qjson qmp-test: Cover control characters more thoroughly, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 16/58] check-qjson: Cover interpolation more thoroughly, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 10/58] check-qjson: Cover escaped characters more thoroughly, part 2, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 30/58] check-qjson: Fix and enable utf8_string()'s disabled part, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 28/58] json: Reject invalid \uXXXX, fix \u0000, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 39/58] json: Pass lexical errors and limit violations to callback, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 19/58] json: Revamp lexer documentation, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 23/58] json: Report first rather than last parse error, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 24/58] json: Leave rejecting invalid UTF-8 to parser, Markus Armbruster <=
- [Qemu-devel] [PULL 05/58] qmp-cmd-test: Split off qmp-test, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 22/58] json: Reject invalid UTF-8 sequences, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 27/58] json: Simplify parse_string(), Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 09/58] check-qjson: Streamline escaped_string()'s test strings, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 35/58] json: Don't pass null @tokens to json_parser_parse(), Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 37/58] json: Rename token JSON_ESCAPE & friends to JSON_INTERP, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 34/58] json: Redesign the callback to consume JSON values, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 08/58] check-qjson: Cover escaped characters more thoroughly, part 1, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 21/58] check-qjson: Document we expect invalid UTF-8 to be rejected, Markus Armbruster, 2018/08/24
- [Qemu-devel] [PULL 31/58] json: remove useless return value from lexer/parser, Markus Armbruster, 2018/08/24

Prev by Date: [Qemu-devel] [PULL 23/58] json: Report first rather than last parse error
Next by Date: [Qemu-devel] [PULL 05/58] qmp-cmd-test: Split off qmp-test
Previous by thread: [Qemu-devel] [PULL 23/58] json: Report first rather than last parse error
Next by thread: [Qemu-devel] [PULL 05/58] qmp-cmd-test: Split off qmp-test
Index(es):
- Date
- Thread