help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: improving error message


From: Hans Åberg
Subject: Re: improving error message
Date: Sat, 10 Nov 2018 14:37:09 +0100

> On 10 Nov 2018, at 12:50, Akim Demaille <address@hidden> wrote:
> 
>> Le 10 nov. 2018 à 10:38, Hans Åberg <address@hidden> a écrit :
>> 
>>> Also, see if using %param does not already
>>> give you what you need to pass information from the scanner to the
>>> parser’s yyerror.
>> 
>> How would that get into the yyerror function?
> 
> In C, arguments of %parse-param are passed to yyerror.  That’s why I mentioned
> %param, not %lex-param.  And in the C++ case, these are members.

Actually, I was thinking about the token error. But for the yyerror function, I 
use C++, and compute the string for data in the semantic value, the prototype 
is:
  void yyparser::error(const location_type& loc, const std::string& errstr)

Then I use it for both errors and warnings, the latter we discussed long ago. 
For errors:
  throw syntax_error(@x, str); // Suitably computed string

For warnings:
  parser::error(@y, "warning: " + str);  // Suitably computed string

Then the error function above has:
  std::string s = "error: ";
  if (errstr.substr(0, 7) == "warning")
    s.clear();

This way, the string beginning with "error: " is not shown in the case of a 
warning.

>>>>> I believe that the right approach is rather the one we have in compilers
>>>>> and in bison: caret errors.
>>>>> 
>>>>> $ cat /tmp/foo.y
>>>>> %token FOO 0xff 0xff
>>>>> %%
>>>>> exp:;
>>>>> $ LC_ALL=C bison /tmp/foo.y
>>>>> /tmp/foo.y:1.17-20: error: syntax error, unexpected integer
>>>>> %token FOO 0xff 0xff
>>>>>              ^^^^
>>>>> I would have been bothered by « unexpected 255 ».
>>>> 
>>>> Currently, that’s for those still using only ASCII.
>>> 
>>> No, it’s not, it works with UTF-8.  Bison’s count of characters is mostly
>>> correct.  I’m talking about Bison’s own location, used to parse grammars,
>>> which is improved compared to what we ship in generated parsers.
>> 
>> Ah. I thought of errors for the generated parser only. Then I only report 
>> byte count, but using character count will probably not help much for caret 
>> errors, as they vary in width. Then problem is that caret errors use two 
>> lines which are hard to synchronize in Unicode. So perhaps some kind of one 
>> line markup instead might do the trick.
> 
> Two things:
> 
> One is that the semantics of Bison’s location’s column is not specified:
> it is up the user to track characters or bytes.  As a matter of fact, Bison
> is hardly concerned by this choice; rather it’s the scanner that has to
> deal with that.
> 
> The other one is: once you have the location, you can decide how to display
> it.  In the case of Bison, I think the caret errors are fine, but you
> could decide to do something different, say use colors or delimiters, to
> be robust to varying width.

Yes, actually I though about the token errors. But it is interesting to see 
what you say about it.

>>>> I am using Unicode characters and LC_CTYPE=UTF-8, so it will not display 
>>>> properly. In fact, I am using special code to even write out Unicode 
>>>> characters in the error strings, since Bison assumes all strings are 
>>>> ASCII, the bytes with the high bit set being translated into escape 
>>>> sequences.
>>> 
>>> Yes, I’m aware of this issue, and we have to address it.
>> 
>> For what I could see, the function that converts it to escapes is sometimes 
>> applied once and sometimes twice, relying on that it is an idempotent.
> 
> It’s a bit more tricky than this.  I’m looking into it, and I’d like
> to address this in 3.3.

I realized one needs to know a lot about Bison's innards to fix this. A thing 
that made me curios is why the function it uses zeroes out the high bit: It 
looks like having something with the POSIX C locale, but I could not find 
anything require it to be set to zero in that locale.

Right now, I use a function that translates the escape sequences back to bytes.

>>> We also have to provide support for internationalization of
>>> the token names.
>> 
>> Personally, I don't have any need for that. I use strings, like
>> %token logical_not_key "¬"
>> %token logical_and_key "∧"
>> %token logical_or_key "∨"
>> and in the case there are names, they typically match what the lexer 
>> identifies.
> 
> Yes, not all the strings should be translated.  I was thinking of
> something like
> 
> %token NUM _("number")
> %token ID _("identifier")
> %token PLUS "+"
> 
> This way, we can even point xgettext to looking at the grammar file
> rather than the generated parser.

It might be good if one wants error messages in another language.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]