help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: improving error message (was: bison for nlp)


From: Akim Demaille
Subject: Re: improving error message (was: bison for nlp)
Date: Sat, 10 Nov 2018 12:50:16 +0100


> Le 10 nov. 2018 à 10:38, Hans Åberg <address@hidden> a écrit :
> 
>> Also, see if using %param does not already
>> give you what you need to pass information from the scanner to the
>> parser’s yyerror.
> 
> How would that get into the yyerror function?

In C, arguments of %parse-param are passed to yyerror.  That’s why I mentioned
%param, not %lex-param.  And in the C++ case, these are members.


>>>> I believe that the right approach is rather the one we have in compilers
>>>> and in bison: caret errors.
>>>> 
>>>> $ cat /tmp/foo.y
>>>> %token FOO 0xff 0xff
>>>> %%
>>>> exp:;
>>>> $ LC_ALL=C bison /tmp/foo.y
>>>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
>>>> %token FOO 0xff 0xff
>>>>               ^^^^
>>>> I would have been bothered by « unexpected 255 ».
>>> 
>>> Currently, that’s for those still using only ASCII.
>> 
>> No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
>> correct.  I’m talking about Bison’s own location, used to parse grammars,
>> which is improved compared to what we ship in generated parsers.
> 
> Ah. I thought of errors for the generated parser only. Then I only report 
> byte count, but using character count will probably not help much for caret 
> errors, as they vary in width. Then problem is that caret errors use two 
> lines which are hard to synchronize in Unicode. So perhaps some kind of one 
> line markup instead might do the trick.

Two things:

One is that the semantics of Bison’s location’s column is not specified:
it is up the user to track characters or bytes.  As a matter of fact, Bison
is hardly concerned by this choice; rather it’s the scanner that has to
deal with that.

The other one is: once you have the location, you can decide how to display
it.  In the case of Bison, I think the caret errors are fine, but you
could decide to do something different, say use colors or delimiters, to
be robust to varying width.



>>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
>>> properly. In fact, I am using special code to even write out Unicode 
>>> characters in the error strings, since Bison assumes all strings are ASCII, 
>>> the bytes with the high bit set being translated into escape sequences.
>> 
>> Yes, I’m aware of this issue, and we have to address it.
> 
> For what I could see, the function that converts it to escapes is sometimes 
> applied once and sometimes twice, relying on that it is an idempotent.

It’s a bit more tricky than this.  I’m looking into it, and I’d like
to address this in 3.3.


>> We also have to provide support for internationalization of
>> the token names.
> 
> Personally, I don't have any need for that. I use strings, like
>  %token logical_not_key "¬"
>  %token logical_and_key "∧"
>  %token logical_or_key "∨"
> and in the case there are names, they typically match what the lexer 
> identifies.

Yes, not all the strings should be translated.  I was thinking of
something like

%token NUM _("number")
%token ID _("identifier")
%token PLUS "+"

This way, we can even point xgettext to looking at the grammar file
rather than the generated parser.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]