qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors


From: Markus Armbruster
Subject: Re: [Qemu-devel] [PATCH 4/6] json: Nicer recovery from lexical errors
Date: Tue, 28 Aug 2018 06:35:12 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Eric Blake <address@hidden> writes:

> On 08/27/2018 02:00 AM, Markus Armbruster wrote:
>> When the lexer chokes on an input character, it consumes the
>> character, emits a JSON error token, and enters its start state.  This
>> can lead to suboptimal error recovery.  For instance, input
>>
>>      0123 ,
>>
>> produces the tokens
>>
>>      JSON_ERROR    01
>>      JSON_INTEGER  23
>>      JSON_COMMA    ,
>>
>> Make the lexer skip characters after a lexical error until a
>> structural character ('[', ']', '{', '}', ':', ','), an ASCII control
>> character, or '\xFE', or '\xFF'.
>>
>> Note that we must not skip ASCII control characters, '\xFE', '\xFF',
>> because those are documented to force the JSON parser into known-good
>> state, by docs/interop/qmp-spec.txt.
>>
>> The lexer now produces
>>
>>      JSON_ERROR    01
>>      JSON_COMMA    ,
>
> So the lexer has now completely skipped the intermediate input, but
> the resulting error message need only point at the start of where
> input went wrong, and skipping to a sane point results in fewer error
> tokens to be reported.  Makes sense.

Exactly.

We could emit a recovery token to let json_message_process_token()
report what we skipped, but I don't think it's worth the trouble.

>> Update qmp-test for the nicer error recovery: QMP now report just one
>
> s/report/reports/

Will fix.

>> error for input %p instead of two.  Also drop the newline after %p; it
>> was needed to tease out the second error.
>
> That's because pre-patch, 'p' is one of the input characters that
> requires lookahead to determine if it forms a complete token (and the
> newline provides the transition needed to consume it); now post-patch,
> the 'p' is consumed as part of the junk after the error is first
> detected at the '%'.

Exactly.

> And to my earlier complaint about 0x1a resulting in JSON_ERROR then
> JSON_INTEGER then JSON_KEYWORD, that sequence is likewise now
> identified as a single JSON_ERROR at the 'x', with the rest of the
> attempted hex number (invalid in JSON) silently skipped.  Nice.

Correct.

>>
>> Signed-off-by: Markus Armbruster <address@hidden>
>> ---
>>   qobject/json-lexer.c | 43 +++++++++++++++++++++++++++++--------------
>>   tests/qmp-test.c     |  6 +-----
>>   2 files changed, 30 insertions(+), 19 deletions(-)
>>
>> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
>> index 28582e17d9..39c7ce7adc 100644
>> --- a/qobject/json-lexer.c
>> +++ b/qobject/json-lexer.c
>> @@ -101,6 +101,7 @@
>>     enum json_lexer_state {
>>       IN_ERROR = 0,               /* must really be 0, see json_lexer[] */
>> +    IN_RECOVERY,
>>       IN_DQ_STRING_ESCAPE,
>>       IN_DQ_STRING,
>>       IN_SQ_STRING_ESCAPE,
>> @@ -130,6 +131,28 @@ QEMU_BUILD_BUG_ON(IN_START_INTERP != IN_START + 1);
>>   static const uint8_t json_lexer[][256] =  {
>>       /* Relies on default initialization to IN_ERROR! */
>>   +    /* error recovery */
>> +    [IN_RECOVERY] = {
>> +        /*
>> +         * Skip characters until a structural character, an ASCII
>> +         * control character other than '\t', or impossible UTF-8
>> +         * bytes '\xFE', '\xFF'.  Structural characters and line
>> +         * endings are promising resynchronization points.  Clients
>> +         * may use the others to force the JSON parser into known-good
>> +         * state; see docs/interop/qmp-spec.txt.
>> +         */
>> +        [0 ... 0x1F] = IN_START | LOOKAHEAD,
>
> And here's where the LOOKAHEAD bit has to be separate, because you are
> now assigning semantics to the transition on '\0' that are distinct
> from either of the two roles previously enumerated as possible.

I could do

            TERMINAL(IN_START)
            [0x20 ... 0xFD] = IN_RECOVERY,
            ['\t'] = IN_RECOVERY,

but it would be awful :)

>> +        [0x20 ... 0xFD] = IN_RECOVERY,
>> +        [0xFE ... 0xFF] = IN_START | LOOKAHEAD,
>> +        ['\t'] = IN_RECOVERY,
>> +        ['['] = IN_START | LOOKAHEAD,
>> +        [']'] = IN_START | LOOKAHEAD,
>> +        ['{'] = IN_START | LOOKAHEAD,
>> +        ['}'] = IN_START | LOOKAHEAD,
>> +        [':'] = IN_START | LOOKAHEAD,
>> +        [','] = IN_START | LOOKAHEAD,
>> +    },
>
> It took me a couple of reads before I was satisfied that everything is
> initialized as desired (range assignments followed by more-specific
> re-assignment works, but isn't common), but this looks right.
>
> Reviewed-by: Eric Blake <address@hidden>

Thanks!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]