bison-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: don't double escape tnames


From: Akim Demaille
Subject: RFC: don't double escape tnames
Date: Sun, 11 Nov 2018 09:37:03 +0100

It is a common request to improve the error messages.  There are two relatively 
low hanging fruits we should address imho: support UTF-8 and 
internationalization.

About UTF-8, I think that Bison was too cautious about the user strings and 
went too far when transforming them.  The goal (I think) was to avoid weird 
characters in the error messages, but that’s none of our business, that’s the 
user’s one.  If she wants to shoot in her foot (or her own users’), there’s no 
reason we should prevent her from doing it.  After all, she can wreck other 
messages elsewhere.

By trying too hard to avoid issues, we broke support for other encodings than 
ASCII.  It was arguably a good thing when there were many of them, as it helped 
users to stick to ASCII.  But then again, if the user knew her program would be 
always in Latin 1, why make things gratuitously hard?

I believe that Bison should never have touched the user’s string, it just has 
to forward them to the compiler.  It is also something that matters in the 
context of internationalization: we would have to address two different 
strings: the one as exposed in the grammar file, and the one actually emitted 
in the parser file.

So this WIP shows what’s on my mind.  It’s available in the branch 
faithful-yytname.  In the case of Bison, that’s the resulting change:

>  static const char *const yytname[] =
>  {
> -  "\"end of file\"", "error", "$undefined", "\"string\"", "\"%token\"",
> -  "\"%nterm\"", "\"%type\"", "\"%destructor\"", "\"%printer\"",
> -  "\"%left\"", "\"%right\"", "\"%nonassoc\"", "\"%precedence\"",
> -  "\"%prec\"", "\"%dprec\"", "\"%merge\"", "\"%code\"",
> -  "\"%default-prec\"", "\"%define\"", "\"%defines\"", "\"%error-verbose\"",
> -  "\"%expect\"", "\"%expect-rr\"", "\"%<flag>\"", "\"%file-prefix\"",
> -  "\"%glr-parser\"", "\"%initial-action\"", "\"%language\"",
> -  "\"%name-prefix\"", "\"%no-default-prec\"", "\"%no-lines\"",
> -  "\"%nondeterministic-parser\"", "\"%output\"", "\"%require\"",
> -  "\"%skeleton\"", "\"%start\"", "\"%token-table\"", "\"%verbose\"",
> -  "\"%yacc\"", "\"{...}\"", "\"%?{...}\"", "\"[identifier]\"", "\"char\"",
> -  "\"epilogue\"", "\"=\"", "\"identifier\"", "\"identifier:\"", "\"%%\"",
> -  "\"|\"", "\"%{...%}\"", "\";\"", "\"<tag>\"", "\"<*>\"", "\"<>\"",
> -  "\"integer\"", "\"%param\"", "\"%union\"", "\"%empty\"", "$accept",
> -  "input", "prologue_declarations", "prologue_declaration", "address@hidden",
> -  "params", "grammar_declaration", "code_props_type", "union_name",
> -  "symbol_declaration", "address@hidden", "address@hidden", 
> "precedence_declaration",
> -  "precedence_declarator", "tag.opt", "symbols.prec", "symbol.prec",
> -  "symbols.1", "generic_symlist", "generic_symlist_item", "tag",
> -  "symbol_def", "symbol_defs.1", "grammar", "rules_or_grammar_declaration",
> -  "rules", "address@hidden", "rhses.1", "rhs", "named_ref.opt", "variable", 
> "value",
> -  "id", "id_colon", "symbol", "string_as_id", "epilogue.opt", YY_NULLPTR
> +  "end of file", "error", "$undefined", "string", "%token", "%nterm",
> +  "%type", "%destructor", "%printer", "%left", "%right", "%nonassoc",
> +  "%precedence", "%prec", "%dprec", "%merge", "%code", "%default-prec",
> +  "%define", "%defines", "%error-verbose", "%expect", "%expect-rr",
> +  "%<flag>", "%file-prefix", "%glr-parser", "%initial-action", "%language",
> +  "%name-prefix", "%no-default-prec", "%no-lines",
> +  "%nondeterministic-parser", "%output", "%require", "%skeleton", "%start",
> +  "%token-table", "%verbose", "%yacc", "{...}", "%?{...}", "[identifier]",
> +  "char", "epilogue", "=", "identifier", "identifier:", "%%", "|",
> +  "%{...%}", ";", "<tag>", "<*>", "<>", "integer", "%param", "%union",
> +  "%empty", "$accept", "input", "prologue_declarations",
> +  "prologue_declaration", "address@hidden", "params", "grammar_declaration",
> +  "code_props_type", "union_name", "symbol_declaration", "address@hidden", 
> "address@hidden",
> +  "precedence_declaration", "precedence_declarator", "tag.opt",
> +  "symbols.prec", "symbol.prec", "symbols.1", "generic_symlist",
> +  "generic_symlist_item", "tag", "symbol_def", "symbol_defs.1", "grammar",
> +  "rules_or_grammar_declaration", "rules", "address@hidden", "rhses.1", 
> "rhs",
> +  "named_ref.opt", "variable", "value", "id", "id_colon", "symbol",
> +  "string_as_id", "epilogue.opt", YY_NULLPTR
>  };


My problem is actually that we documented yytname.  With some details but not 
all, see:

https://www.gnu.org/software/bison/manual/html_node/Calling-Convention.html

Here, Rici, I would very much like your input: I have browsed some of the 
StackOverflow items about yytname, and I believe we are ok, but you surely know 
better.


WRT i18n, my idea is to have something like

%token NUM _("number")
%token ID _("identifier")
%token PLUS "+"

This way, we can even point xgettext to the grammar file rather than the 
generated parser, which is a good thing given that the user is expected to use 
bison-runtime as domain for the messages emitted by Bison.  FTR, currently we 
have in Bison bits of English in the messages (see ‘string’ and ‘end of file’):

$ bison =(echo '"')
/tmp/zsh97d4JY:1.1-2.0: erreur: « " » manquant à la fin de fichier
 "
 ^
/tmp/zsh97d4JY:1.1-2.0: erreur: erreur de syntaxe, string inattendu
 "
 ^
$ bison =(echo '%token FOO "')
/tmp/zshljrcwc:1.12-2.0: erreur: « " » manquant à la fin de fichier
 %token FOO "
            ^
/tmp/zshljrcwc:2.1: erreur: erreur de syntaxe, end of file inattendu


Cheers!


commit 49191f15b35f217446a8722d981964e8808146b3
Author: Akim Demaille <address@hidden>
Date:   Sat Nov 10 18:02:08 2018 +0100

    WIP: don't double escape tnames
    
    * #: .

diff --git a/data/glr.c b/data/glr.c
index ed79cab6..c2762e8d 100644
--- a/data/glr.c
+++ b/data/glr.c
@@ -570,36 +570,6 @@ yystpcpy (char *yydest, const char *yysrc)
 static size_t
 yytnamerr (char *yyres, const char *yystr)
 {
-  if (*yystr == '"')
-    {
-      size_t yyn = 0;
-      char const *yyp = yystr;
-
-      for (;;)
-        switch (*++yyp)
-          {
-          case '\'':
-          case ',':
-            goto do_not_strip_quotes;
-
-          case '\\':
-            if (*++yyp != '\\')
-              goto do_not_strip_quotes;
-            /* Fall through.  */
-          default:
-            if (yyres)
-              yyres[yyn] = *yyp;
-            yyn++;
-            break;
-
-          case '"':
-            if (yyres)
-              yyres[yyn] = '\0';
-            return yyn;
-          }
-    do_not_strip_quotes: ;
-    }
-
   if (! yyres)
     return strlen (yystr);
 
diff --git a/data/lalr1.cc b/data/lalr1.cc
index 8706eda6..a026998b 100644
--- a/data/lalr1.cc
+++ b/data/lalr1.cc
@@ -520,32 +520,6 @@ m4_if(b4_prefix, [yy], [],
   std::string
   ]b4_parser_class_name[::yytnamerr_ (const char *yystr)
   {
-    if (*yystr == '"')
-      {
-        std::string yyr;
-        char const *yyp = yystr;
-
-        for (;;)
-          switch (*++yyp)
-            {
-            case '\'':
-            case ',':
-              goto do_not_strip_quotes;
-
-            case '\\':
-              if (*++yyp != '\\')
-                goto do_not_strip_quotes;
-              // Fall through.
-            default:
-              yyr += *yyp;
-              break;
-
-            case '"':
-              return yyr;
-            }
-      do_not_strip_quotes: ;
-      }
-
     return yystr;
   }
 ]])[
diff --git a/data/lalr1.java b/data/lalr1.java
index 7326d624..23c0e470 100644
--- a/data/lalr1.java
+++ b/data/lalr1.java
@@ -508,29 +508,7 @@ b4_define_state])[
      YYSTR is taken from yytname.  */
   private final String yytnamerr_ (String yystr)
   {
-    if (yystr.charAt (0) == '"')
-      {
-        StringBuffer yyr = new StringBuffer ();
-        strip_quotes: for (int i = 1; i < yystr.length (); i++)
-          switch (yystr.charAt (i))
-            {
-            case '\'':
-            case ',':
-              break strip_quotes;
-
-            case '\\':
-              if (yystr.charAt(++i) != '\\')
-                break strip_quotes;
-              /* Fall through.  */
-            default:
-              yyr.append (yystr.charAt (i));
-              break;
-
-            case '"':
-              return yyr.toString ();
-            }
-      }
-    else if (yystr.equals ("$end"))
+    if (yystr.equals ("$end"))
       return "end of input";
 
     return yystr;
diff --git a/data/yacc.c b/data/yacc.c
index 0ba48bef..bd15f533 100644
--- a/data/yacc.c
+++ b/data/yacc.c
@@ -1052,36 +1052,6 @@ yy_lac (yytype_int16 *yyesa, yytype_int16 **yyes,
 static YYSIZE_T
 yytnamerr (char *yyres, const char *yystr)
 {
-  if (*yystr == '"')
-    {
-      YYSIZE_T yyn = 0;
-      char const *yyp = yystr;
-
-      for (;;)
-        switch (*++yyp)
-          {
-          case '\'':
-          case ',':
-            goto do_not_strip_quotes;
-
-          case '\\':
-            if (*++yyp != '\\')
-              goto do_not_strip_quotes;
-            /* Fall through.  */
-          default:
-            if (yyres)
-              yyres[yyn] = *yyp;
-            yyn++;
-            break;
-
-          case '"':
-            if (yyres)
-              yyres[yyn] = '\0';
-            return yyn;
-          }
-    do_not_strip_quotes: ;
-    }
-
   if (! yyres)
     return yystrlen (yystr);
 
diff --git a/src/output.c b/src/output.c
index 96295d17..5442dd27 100644
--- a/src/output.c
+++ b/src/output.c
@@ -163,7 +163,10 @@ prepare_symbols (void)
     set_quoting_flags (qo, QA_SPLIT_TRIGRAPHS);
     for (int i = 0; i < nsyms; i++)
       {
-        char *cp = quotearg_alloc (symbols[i]->tag, -1, qo);
+        char *cp =
+          symbols[i]->tag[0] == '"'
+          ? xstrdup (symbols[i]->tag)
+          : quotearg_alloc (symbols[i]->tag, -1, qo);
         /* Width of the next token, including the two quotes, the
            comma and the space.  */
         int width = strlen (cp) + 2;
diff --git a/tests/javapush.at b/tests/javapush.at
index 6b5847f5..34717458 100644
--- a/tests/javapush.at
+++ b/tests/javapush.at
@@ -726,121 +726,121 @@ total = 256
 total = 64
 ]])
 
-AT_DATA([locations],[[Next token is token "number" (1.1: 1)
+AT_DATA([locations],[[Next token is token number (1.1: 1)
 Next token is token '+' (1.2: 1)
-Next token is token "number" (1.3: 2)
+Next token is token number (1.3: 2)
 Next token is token '*' (1.4: 2)
-Next token is token "number" (1.5: 3)
+Next token is token number (1.5: 3)
 Next token is token '=' (1.6: 3)
 Next token is token '=' (1.6: 3)
 Next token is token '=' (1.6: 3)
-Next token is token "number" (1.7: 7)
+Next token is token number (1.7: 7)
 Next token is token '\n' (2.0: 7)
 Next token is token '\n' (2.0: 7)
-Next token is token "number" (2.1: 1)
+Next token is token number (2.1: 1)
 Next token is token '+' (2.2: 1)
-Next token is token "number" (2.3: 2)
+Next token is token number (2.3: 2)
 Next token is token '*' (2.4: 2)
 Next token is token '-' (2.5: 2)
-Next token is token "number" (2.6: 3)
+Next token is token number (2.6: 3)
 Next token is token '=' (2.7: 3)
 Next token is token '=' (2.7: 3)
 Next token is token '=' (2.7: 3)
 Next token is token '=' (2.7: 3)
 Next token is token '-' (2.8: 3)
-Next token is token "number" (2.9: 5)
+Next token is token number (2.9: 5)
 Next token is token '\n' (3.0: 5)
 Next token is token '\n' (3.0: 5)
 Next token is token '\n' (3.0: 5)
 Next token is token '\n' (4.0: 5)
 Next token is token '-' (4.1: 5)
-Next token is token "number" (4.2: 1)
+Next token is token number (4.2: 1)
 Next token is token '^' (4.3: 1)
-Next token is token "number" (4.4: 2)
+Next token is token number (4.4: 2)
 Next token is token '=' (4.5: 2)
 Next token is token '=' (4.5: 2)
 Next token is token '=' (4.5: 2)
 Next token is token '-' (4.6: 2)
-Next token is token "number" (4.7: 1)
+Next token is token number (4.7: 1)
 Next token is token '\n' (5.0: 1)
 Next token is token '\n' (5.0: 1)
 Next token is token '\n' (5.0: 1)
 Next token is token '(' (5.1: 1)
 Next token is token '-' (5.2: 1)
-Next token is token "number" (5.3: 1)
+Next token is token number (5.3: 1)
 Next token is token ')' (5.4: 1)
 Next token is token ')' (5.4: 1)
 Next token is token '^' (5.5: 1)
-Next token is token "number" (5.6: 2)
+Next token is token number (5.6: 2)
 Next token is token '=' (5.7: 2)
 Next token is token '=' (5.7: 2)
-Next token is token "number" (5.8: 1)
+Next token is token number (5.8: 1)
 Next token is token '\n' (6.0: 1)
 Next token is token '\n' (6.0: 1)
 Next token is token '\n' (7.0: 1)
 Next token is token '-' (7.1: 1)
 Next token is token '-' (7.2: 1)
 Next token is token '-' (7.3: 1)
-Next token is token "number" (7.4: 1)
+Next token is token number (7.4: 1)
 Next token is token '=' (7.5: 1)
 Next token is token '=' (7.5: 1)
 Next token is token '=' (7.5: 1)
 Next token is token '=' (7.5: 1)
 Next token is token '-' (7.6: 1)
-Next token is token "number" (7.7: 1)
+Next token is token number (7.7: 1)
 Next token is token '\n' (8.0: 1)
 Next token is token '\n' (8.0: 1)
 Next token is token '\n' (8.0: 1)
 Next token is token '\n' (9.0: 1)
-Next token is token "number" (9.1: 1)
+Next token is token number (9.1: 1)
 Next token is token '-' (9.2: 1)
-Next token is token "number" (9.3: 2)
+Next token is token number (9.3: 2)
 Next token is token '-' (9.4: 2)
 Next token is token '-' (9.4: 2)
-Next token is token "number" (9.5: 3)
+Next token is token number (9.5: 3)
 Next token is token '=' (9.6: 3)
 Next token is token '=' (9.6: 3)
 Next token is token '-' (9.7: 3)
-Next token is token "number" (9.8: 4)
+Next token is token number (9.8: 4)
 Next token is token '\n' (10.0: 4)
 Next token is token '\n' (10.0: 4)
 Next token is token '\n' (10.0: 4)
-Next token is token "number" (10.1: 1)
+Next token is token number (10.1: 1)
 Next token is token '-' (10.2: 1)
 Next token is token '(' (10.3: 1)
-Next token is token "number" (10.4: 2)
+Next token is token number (10.4: 2)
 Next token is token '-' (10.5: 2)
-Next token is token "number" (10.6: 3)
+Next token is token number (10.6: 3)
 Next token is token ')' (10.7: 3)
 Next token is token ')' (10.7: 3)
 Next token is token '=' (10.8: 3)
 Next token is token '=' (10.8: 3)
-Next token is token "number" (10.9: 2)
+Next token is token number (10.9: 2)
 Next token is token '\n' (11.0: 2)
 Next token is token '\n' (11.0: 2)
 Next token is token '\n' (12.0: 2)
-Next token is token "number" (12.1: 2)
+Next token is token number (12.1: 2)
 Next token is token '^' (12.2: 2)
-Next token is token "number" (12.3: 2)
+Next token is token number (12.3: 2)
 Next token is token '^' (12.4: 2)
-Next token is token "number" (12.5: 3)
+Next token is token number (12.5: 3)
 Next token is token '=' (12.6: 3)
 Next token is token '=' (12.6: 3)
 Next token is token '=' (12.6: 3)
-Next token is token "number" (12.7: 256)
+Next token is token number (12.7: 256)
 Next token is token '\n' (13.0: 256)
 Next token is token '\n' (13.0: 256)
 Next token is token '(' (13.1: 256)
-Next token is token "number" (13.2: 2)
+Next token is token number (13.2: 2)
 Next token is token '^' (13.3: 2)
-Next token is token "number" (13.4: 2)
+Next token is token number (13.4: 2)
 Next token is token ')' (13.5: 2)
 Next token is token ')' (13.5: 2)
 Next token is token '^' (13.6: 2)
-Next token is token "number" (13.7: 3)
+Next token is token number (13.7: 3)
 Next token is token '=' (13.8: 3)
 Next token is token '=' (13.8: 3)
-Next token is token "number" (13.9: 64)
+Next token is token number (13.9: 64)
 Next token is token '\n' (14.0: 64)
 Next token is token '\n' (14.0: 64)
 ]])
diff --git a/tests/regression.at b/tests/regression.at
index f24c60cc..dc212d47 100644
--- a/tests/regression.at
+++ b/tests/regression.at
@@ -460,11 +460,12 @@ AT_DATA_GRAMMAR([input.y],
 %token 'd' D_TOKEN
 %token SPECIAL "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!"
 %token SPECIAL "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!"
+%token MAGIC "∃¬∩∪∀"
 %%
-exp: "a" "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!";
+exp: "a" MAGIC;
 %%
 ]AT_YYERROR_DEFINE[
-]AT_YYLEX_DEFINE([{ SPECIAL }])[
+]AT_YYLEX_DEFINE([{ MAGIC }])[
 ]AT_MAIN_DEFINE[
 ]])
 AT_BISON_OPTION_POPDEFS
@@ -487,14 +488,9 @@ input.y:22.8-63: warning: symbol 
"\\'?\"\a\b\f\n\r\t\v\001\201\001\201??!" used
 ]])
 AT_COMPILE([input])
 
-# Checking the error message here guarantees that yytname, which does contain
-# C-string literals, does have the trigraph escaped correctly.  Thus, the
-# symbol name reported by the parser is exactly the same as that reported by
-# Bison itself.
-AT_DATA([experr],
-[[syntax error, unexpected "\\'?\"\a\b\f\n\r\t\v\001\201\001\201??!", 
expecting a
+AT_PARSER_CHECK([./input], 1, [],
+[[syntax error, unexpected ∃¬∩∪∀, expecting a
 ]])
-AT_PARSER_CHECK([./input], 1, [], [experr])
 AT_CLEANUP
 
 
@@ -769,8 +765,8 @@ static const yytype_uint8 yyrline[] =
 };
 static const char *const yytname[] =
 {
-  "$end", "error", "$undefined", "\"if\"", "\"const\"", "\"then\"",
-  "\"else\"", "$accept", "statement", "struct_stat", "if", "else", YY_NULLPTR
+  "$end", "error", "$undefined", "if", "const", "then", "else", "$accept",
+  "statement", "struct_stat", "if", "else", YY_NULLPTR
 };
 static const yytype_uint16 yytoknum[] =
 {




reply via email to

[Prev in Thread] Current Thread [Next in Thread]