[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RFC: don't double escape tnames
From: |
Akim Demaille |
Subject: |
RFC: don't double escape tnames |
Date: |
Sun, 11 Nov 2018 09:37:03 +0100 |
It is a common request to improve the error messages. There are two relatively
low hanging fruits we should address imho: support UTF-8 and
internationalization.
About UTF-8, I think that Bison was too cautious about the user strings and
went too far when transforming them. The goal (I think) was to avoid weird
characters in the error messages, but that’s none of our business, that’s the
user’s one. If she wants to shoot in her foot (or her own users’), there’s no
reason we should prevent her from doing it. After all, she can wreck other
messages elsewhere.
By trying too hard to avoid issues, we broke support for other encodings than
ASCII. It was arguably a good thing when there were many of them, as it helped
users to stick to ASCII. But then again, if the user knew her program would be
always in Latin 1, why make things gratuitously hard?
I believe that Bison should never have touched the user’s string, it just has
to forward them to the compiler. It is also something that matters in the
context of internationalization: we would have to address two different
strings: the one as exposed in the grammar file, and the one actually emitted
in the parser file.
So this WIP shows what’s on my mind. It’s available in the branch
faithful-yytname. In the case of Bison, that’s the resulting change:
> static const char *const yytname[] =
> {
> - "\"end of file\"", "error", "$undefined", "\"string\"", "\"%token\"",
> - "\"%nterm\"", "\"%type\"", "\"%destructor\"", "\"%printer\"",
> - "\"%left\"", "\"%right\"", "\"%nonassoc\"", "\"%precedence\"",
> - "\"%prec\"", "\"%dprec\"", "\"%merge\"", "\"%code\"",
> - "\"%default-prec\"", "\"%define\"", "\"%defines\"", "\"%error-verbose\"",
> - "\"%expect\"", "\"%expect-rr\"", "\"%<flag>\"", "\"%file-prefix\"",
> - "\"%glr-parser\"", "\"%initial-action\"", "\"%language\"",
> - "\"%name-prefix\"", "\"%no-default-prec\"", "\"%no-lines\"",
> - "\"%nondeterministic-parser\"", "\"%output\"", "\"%require\"",
> - "\"%skeleton\"", "\"%start\"", "\"%token-table\"", "\"%verbose\"",
> - "\"%yacc\"", "\"{...}\"", "\"%?{...}\"", "\"[identifier]\"", "\"char\"",
> - "\"epilogue\"", "\"=\"", "\"identifier\"", "\"identifier:\"", "\"%%\"",
> - "\"|\"", "\"%{...%}\"", "\";\"", "\"<tag>\"", "\"<*>\"", "\"<>\"",
> - "\"integer\"", "\"%param\"", "\"%union\"", "\"%empty\"", "$accept",
> - "input", "prologue_declarations", "prologue_declaration", "address@hidden",
> - "params", "grammar_declaration", "code_props_type", "union_name",
> - "symbol_declaration", "address@hidden", "address@hidden",
> "precedence_declaration",
> - "precedence_declarator", "tag.opt", "symbols.prec", "symbol.prec",
> - "symbols.1", "generic_symlist", "generic_symlist_item", "tag",
> - "symbol_def", "symbol_defs.1", "grammar", "rules_or_grammar_declaration",
> - "rules", "address@hidden", "rhses.1", "rhs", "named_ref.opt", "variable",
> "value",
> - "id", "id_colon", "symbol", "string_as_id", "epilogue.opt", YY_NULLPTR
> + "end of file", "error", "$undefined", "string", "%token", "%nterm",
> + "%type", "%destructor", "%printer", "%left", "%right", "%nonassoc",
> + "%precedence", "%prec", "%dprec", "%merge", "%code", "%default-prec",
> + "%define", "%defines", "%error-verbose", "%expect", "%expect-rr",
> + "%<flag>", "%file-prefix", "%glr-parser", "%initial-action", "%language",
> + "%name-prefix", "%no-default-prec", "%no-lines",
> + "%nondeterministic-parser", "%output", "%require", "%skeleton", "%start",
> + "%token-table", "%verbose", "%yacc", "{...}", "%?{...}", "[identifier]",
> + "char", "epilogue", "=", "identifier", "identifier:", "%%", "|",
> + "%{...%}", ";", "<tag>", "<*>", "<>", "integer", "%param", "%union",
> + "%empty", "$accept", "input", "prologue_declarations",
> + "prologue_declaration", "address@hidden", "params", "grammar_declaration",
> + "code_props_type", "union_name", "symbol_declaration", "address@hidden",
> "address@hidden",
> + "precedence_declaration", "precedence_declarator", "tag.opt",
> + "symbols.prec", "symbol.prec", "symbols.1", "generic_symlist",
> + "generic_symlist_item", "tag", "symbol_def", "symbol_defs.1", "grammar",
> + "rules_or_grammar_declaration", "rules", "address@hidden", "rhses.1",
> "rhs",
> + "named_ref.opt", "variable", "value", "id", "id_colon", "symbol",
> + "string_as_id", "epilogue.opt", YY_NULLPTR
> };
My problem is actually that we documented yytname. With some details but not
all, see:
https://www.gnu.org/software/bison/manual/html_node/Calling-Convention.html
Here, Rici, I would very much like your input: I have browsed some of the
StackOverflow items about yytname, and I believe we are ok, but you surely know
better.
WRT i18n, my idea is to have something like
%token NUM _("number")
%token ID _("identifier")
%token PLUS "+"
This way, we can even point xgettext to the grammar file rather than the
generated parser, which is a good thing given that the user is expected to use
bison-runtime as domain for the messages emitted by Bison. FTR, currently we
have in Bison bits of English in the messages (see ‘string’ and ‘end of file’):
$ bison =(echo '"')
/tmp/zsh97d4JY:1.1-2.0: erreur: « " » manquant à la fin de fichier
"
^
/tmp/zsh97d4JY:1.1-2.0: erreur: erreur de syntaxe, string inattendu
"
^
$ bison =(echo '%token FOO "')
/tmp/zshljrcwc:1.12-2.0: erreur: « " » manquant à la fin de fichier
%token FOO "
^
/tmp/zshljrcwc:2.1: erreur: erreur de syntaxe, end of file inattendu
Cheers!
commit 49191f15b35f217446a8722d981964e8808146b3
Author: Akim Demaille <address@hidden>
Date: Sat Nov 10 18:02:08 2018 +0100
WIP: don't double escape tnames
* #: .
diff --git a/data/glr.c b/data/glr.c
index ed79cab6..c2762e8d 100644
--- a/data/glr.c
+++ b/data/glr.c
@@ -570,36 +570,6 @@ yystpcpy (char *yydest, const char *yysrc)
static size_t
yytnamerr (char *yyres, const char *yystr)
{
- if (*yystr == '"')
- {
- size_t yyn = 0;
- char const *yyp = yystr;
-
- for (;;)
- switch (*++yyp)
- {
- case '\'':
- case ',':
- goto do_not_strip_quotes;
-
- case '\\':
- if (*++yyp != '\\')
- goto do_not_strip_quotes;
- /* Fall through. */
- default:
- if (yyres)
- yyres[yyn] = *yyp;
- yyn++;
- break;
-
- case '"':
- if (yyres)
- yyres[yyn] = '\0';
- return yyn;
- }
- do_not_strip_quotes: ;
- }
-
if (! yyres)
return strlen (yystr);
diff --git a/data/lalr1.cc b/data/lalr1.cc
index 8706eda6..a026998b 100644
--- a/data/lalr1.cc
+++ b/data/lalr1.cc
@@ -520,32 +520,6 @@ m4_if(b4_prefix, [yy], [],
std::string
]b4_parser_class_name[::yytnamerr_ (const char *yystr)
{
- if (*yystr == '"')
- {
- std::string yyr;
- char const *yyp = yystr;
-
- for (;;)
- switch (*++yyp)
- {
- case '\'':
- case ',':
- goto do_not_strip_quotes;
-
- case '\\':
- if (*++yyp != '\\')
- goto do_not_strip_quotes;
- // Fall through.
- default:
- yyr += *yyp;
- break;
-
- case '"':
- return yyr;
- }
- do_not_strip_quotes: ;
- }
-
return yystr;
}
]])[
diff --git a/data/lalr1.java b/data/lalr1.java
index 7326d624..23c0e470 100644
--- a/data/lalr1.java
+++ b/data/lalr1.java
@@ -508,29 +508,7 @@ b4_define_state])[
YYSTR is taken from yytname. */
private final String yytnamerr_ (String yystr)
{
- if (yystr.charAt (0) == '"')
- {
- StringBuffer yyr = new StringBuffer ();
- strip_quotes: for (int i = 1; i < yystr.length (); i++)
- switch (yystr.charAt (i))
- {
- case '\'':
- case ',':
- break strip_quotes;
-
- case '\\':
- if (yystr.charAt(++i) != '\\')
- break strip_quotes;
- /* Fall through. */
- default:
- yyr.append (yystr.charAt (i));
- break;
-
- case '"':
- return yyr.toString ();
- }
- }
- else if (yystr.equals ("$end"))
+ if (yystr.equals ("$end"))
return "end of input";
return yystr;
diff --git a/data/yacc.c b/data/yacc.c
index 0ba48bef..bd15f533 100644
--- a/data/yacc.c
+++ b/data/yacc.c
@@ -1052,36 +1052,6 @@ yy_lac (yytype_int16 *yyesa, yytype_int16 **yyes,
static YYSIZE_T
yytnamerr (char *yyres, const char *yystr)
{
- if (*yystr == '"')
- {
- YYSIZE_T yyn = 0;
- char const *yyp = yystr;
-
- for (;;)
- switch (*++yyp)
- {
- case '\'':
- case ',':
- goto do_not_strip_quotes;
-
- case '\\':
- if (*++yyp != '\\')
- goto do_not_strip_quotes;
- /* Fall through. */
- default:
- if (yyres)
- yyres[yyn] = *yyp;
- yyn++;
- break;
-
- case '"':
- if (yyres)
- yyres[yyn] = '\0';
- return yyn;
- }
- do_not_strip_quotes: ;
- }
-
if (! yyres)
return yystrlen (yystr);
diff --git a/src/output.c b/src/output.c
index 96295d17..5442dd27 100644
--- a/src/output.c
+++ b/src/output.c
@@ -163,7 +163,10 @@ prepare_symbols (void)
set_quoting_flags (qo, QA_SPLIT_TRIGRAPHS);
for (int i = 0; i < nsyms; i++)
{
- char *cp = quotearg_alloc (symbols[i]->tag, -1, qo);
+ char *cp =
+ symbols[i]->tag[0] == '"'
+ ? xstrdup (symbols[i]->tag)
+ : quotearg_alloc (symbols[i]->tag, -1, qo);
/* Width of the next token, including the two quotes, the
comma and the space. */
int width = strlen (cp) + 2;
diff --git a/tests/javapush.at b/tests/javapush.at
index 6b5847f5..34717458 100644
--- a/tests/javapush.at
+++ b/tests/javapush.at
@@ -726,121 +726,121 @@ total = 256
total = 64
]])
-AT_DATA([locations],[[Next token is token "number" (1.1: 1)
+AT_DATA([locations],[[Next token is token number (1.1: 1)
Next token is token '+' (1.2: 1)
-Next token is token "number" (1.3: 2)
+Next token is token number (1.3: 2)
Next token is token '*' (1.4: 2)
-Next token is token "number" (1.5: 3)
+Next token is token number (1.5: 3)
Next token is token '=' (1.6: 3)
Next token is token '=' (1.6: 3)
Next token is token '=' (1.6: 3)
-Next token is token "number" (1.7: 7)
+Next token is token number (1.7: 7)
Next token is token '\n' (2.0: 7)
Next token is token '\n' (2.0: 7)
-Next token is token "number" (2.1: 1)
+Next token is token number (2.1: 1)
Next token is token '+' (2.2: 1)
-Next token is token "number" (2.3: 2)
+Next token is token number (2.3: 2)
Next token is token '*' (2.4: 2)
Next token is token '-' (2.5: 2)
-Next token is token "number" (2.6: 3)
+Next token is token number (2.6: 3)
Next token is token '=' (2.7: 3)
Next token is token '=' (2.7: 3)
Next token is token '=' (2.7: 3)
Next token is token '=' (2.7: 3)
Next token is token '-' (2.8: 3)
-Next token is token "number" (2.9: 5)
+Next token is token number (2.9: 5)
Next token is token '\n' (3.0: 5)
Next token is token '\n' (3.0: 5)
Next token is token '\n' (3.0: 5)
Next token is token '\n' (4.0: 5)
Next token is token '-' (4.1: 5)
-Next token is token "number" (4.2: 1)
+Next token is token number (4.2: 1)
Next token is token '^' (4.3: 1)
-Next token is token "number" (4.4: 2)
+Next token is token number (4.4: 2)
Next token is token '=' (4.5: 2)
Next token is token '=' (4.5: 2)
Next token is token '=' (4.5: 2)
Next token is token '-' (4.6: 2)
-Next token is token "number" (4.7: 1)
+Next token is token number (4.7: 1)
Next token is token '\n' (5.0: 1)
Next token is token '\n' (5.0: 1)
Next token is token '\n' (5.0: 1)
Next token is token '(' (5.1: 1)
Next token is token '-' (5.2: 1)
-Next token is token "number" (5.3: 1)
+Next token is token number (5.3: 1)
Next token is token ')' (5.4: 1)
Next token is token ')' (5.4: 1)
Next token is token '^' (5.5: 1)
-Next token is token "number" (5.6: 2)
+Next token is token number (5.6: 2)
Next token is token '=' (5.7: 2)
Next token is token '=' (5.7: 2)
-Next token is token "number" (5.8: 1)
+Next token is token number (5.8: 1)
Next token is token '\n' (6.0: 1)
Next token is token '\n' (6.0: 1)
Next token is token '\n' (7.0: 1)
Next token is token '-' (7.1: 1)
Next token is token '-' (7.2: 1)
Next token is token '-' (7.3: 1)
-Next token is token "number" (7.4: 1)
+Next token is token number (7.4: 1)
Next token is token '=' (7.5: 1)
Next token is token '=' (7.5: 1)
Next token is token '=' (7.5: 1)
Next token is token '=' (7.5: 1)
Next token is token '-' (7.6: 1)
-Next token is token "number" (7.7: 1)
+Next token is token number (7.7: 1)
Next token is token '\n' (8.0: 1)
Next token is token '\n' (8.0: 1)
Next token is token '\n' (8.0: 1)
Next token is token '\n' (9.0: 1)
-Next token is token "number" (9.1: 1)
+Next token is token number (9.1: 1)
Next token is token '-' (9.2: 1)
-Next token is token "number" (9.3: 2)
+Next token is token number (9.3: 2)
Next token is token '-' (9.4: 2)
Next token is token '-' (9.4: 2)
-Next token is token "number" (9.5: 3)
+Next token is token number (9.5: 3)
Next token is token '=' (9.6: 3)
Next token is token '=' (9.6: 3)
Next token is token '-' (9.7: 3)
-Next token is token "number" (9.8: 4)
+Next token is token number (9.8: 4)
Next token is token '\n' (10.0: 4)
Next token is token '\n' (10.0: 4)
Next token is token '\n' (10.0: 4)
-Next token is token "number" (10.1: 1)
+Next token is token number (10.1: 1)
Next token is token '-' (10.2: 1)
Next token is token '(' (10.3: 1)
-Next token is token "number" (10.4: 2)
+Next token is token number (10.4: 2)
Next token is token '-' (10.5: 2)
-Next token is token "number" (10.6: 3)
+Next token is token number (10.6: 3)
Next token is token ')' (10.7: 3)
Next token is token ')' (10.7: 3)
Next token is token '=' (10.8: 3)
Next token is token '=' (10.8: 3)
-Next token is token "number" (10.9: 2)
+Next token is token number (10.9: 2)
Next token is token '\n' (11.0: 2)
Next token is token '\n' (11.0: 2)
Next token is token '\n' (12.0: 2)
-Next token is token "number" (12.1: 2)
+Next token is token number (12.1: 2)
Next token is token '^' (12.2: 2)
-Next token is token "number" (12.3: 2)
+Next token is token number (12.3: 2)
Next token is token '^' (12.4: 2)
-Next token is token "number" (12.5: 3)
+Next token is token number (12.5: 3)
Next token is token '=' (12.6: 3)
Next token is token '=' (12.6: 3)
Next token is token '=' (12.6: 3)
-Next token is token "number" (12.7: 256)
+Next token is token number (12.7: 256)
Next token is token '\n' (13.0: 256)
Next token is token '\n' (13.0: 256)
Next token is token '(' (13.1: 256)
-Next token is token "number" (13.2: 2)
+Next token is token number (13.2: 2)
Next token is token '^' (13.3: 2)
-Next token is token "number" (13.4: 2)
+Next token is token number (13.4: 2)
Next token is token ')' (13.5: 2)
Next token is token ')' (13.5: 2)
Next token is token '^' (13.6: 2)
-Next token is token "number" (13.7: 3)
+Next token is token number (13.7: 3)
Next token is token '=' (13.8: 3)
Next token is token '=' (13.8: 3)
-Next token is token "number" (13.9: 64)
+Next token is token number (13.9: 64)
Next token is token '\n' (14.0: 64)
Next token is token '\n' (14.0: 64)
]])
diff --git a/tests/regression.at b/tests/regression.at
index f24c60cc..dc212d47 100644
--- a/tests/regression.at
+++ b/tests/regression.at
@@ -460,11 +460,12 @@ AT_DATA_GRAMMAR([input.y],
%token 'd' D_TOKEN
%token SPECIAL "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!"
%token SPECIAL "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!"
+%token MAGIC "∃¬∩∪∀"
%%
-exp: "a" "\\\'\?\"\a\b\f\n\r\t\v\001\201\x001\x000081??!";
+exp: "a" MAGIC;
%%
]AT_YYERROR_DEFINE[
-]AT_YYLEX_DEFINE([{ SPECIAL }])[
+]AT_YYLEX_DEFINE([{ MAGIC }])[
]AT_MAIN_DEFINE[
]])
AT_BISON_OPTION_POPDEFS
@@ -487,14 +488,9 @@ input.y:22.8-63: warning: symbol
"\\'?\"\a\b\f\n\r\t\v\001\201\001\201??!" used
]])
AT_COMPILE([input])
-# Checking the error message here guarantees that yytname, which does contain
-# C-string literals, does have the trigraph escaped correctly. Thus, the
-# symbol name reported by the parser is exactly the same as that reported by
-# Bison itself.
-AT_DATA([experr],
-[[syntax error, unexpected "\\'?\"\a\b\f\n\r\t\v\001\201\001\201??!",
expecting a
+AT_PARSER_CHECK([./input], 1, [],
+[[syntax error, unexpected ∃¬∩∪∀, expecting a
]])
-AT_PARSER_CHECK([./input], 1, [], [experr])
AT_CLEANUP
@@ -769,8 +765,8 @@ static const yytype_uint8 yyrline[] =
};
static const char *const yytname[] =
{
- "$end", "error", "$undefined", "\"if\"", "\"const\"", "\"then\"",
- "\"else\"", "$accept", "statement", "struct_stat", "if", "else", YY_NULLPTR
+ "$end", "error", "$undefined", "if", "const", "then", "else", "$accept",
+ "statement", "struct_stat", "if", "else", YY_NULLPTR
};
static const yytype_uint16 yytoknum[] =
{
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- RFC: don't double escape tnames,
Akim Demaille <=