Re: [PATCH 00/17] RFC: multiple start symbols

bison-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 00/17] RFC: multiple start symbols

From:	Adrian Vogelsgesang
Subject:	Re: [PATCH 00/17] RFC: multiple start symbols
Date:	Wed, 23 Sep 2020 11:15:03 +0000
User-agent:	Microsoft-MacOutlook/16.40.20081000

Hi Akim, hi Paul,

I second Paul’s reaction: To me this also would be a very valuable addition to 
bison.

> * Why have multiple start symbols?

We at Tableau/Hyper have the exact same use case as described In the RACK 
manual:
We need to be able to parse both SQL statements (SELECT, INSERT, …) and 
standalone SQL expressions (“a IS NOT NULL”, “SUBSTR(a, b)”, …) and want to 
reuse the same grammar/parser for both.

>> * Suppose Bison didn't have multiple start symbols; how could you achieve 
>> their effect? (This is already in the Bison manual, I think.)
> Yep, 
> https://www.gnu.org/software/bison/manual/html_node/Multiple-start_002dsymbols.html#Multiple-start_002dsymbols<https://www.gnu.org/software/bison/manual/html_node/Multiple-start_002dsymbols.html#Multiple-start_002dsymbols>.
>  It still features my original spellos from 2006...

This is exactly how we currently solved our multi-start parsing issues.

> I see several advantages, of varying importance (to my eyes):
> […]
> 2. Simpler Scanner (quite significant)

yes, I also see this as a significant benefit.
We are using our own hand-rolled lexer which made it easier to modify our code, 
but still adding support for injecting an “initial token” was not 
straightforward. I can imagine this only gets harder with lexer generators.

> 3. Better Calling Conventions (large improvement imho)

Love it! :)

> It might be useful for some use cases to exhibit yyparse_yyimpl (which means 
> choosing a public name, and also rename the switching token from YY_PARSE_FOO 
> to YYPARSE_FOO,
>  since I try to keep private details under YY_*/yy_*, and public ones in 
> YY*/yy*).
>
> I don't know about that. Feedback would be most useful!

At least for our use case, we don’t need a “dynamic dispatch” on whether to 
parse a “statement” or an “expression”.
The call-site always statically knows whether it expects a statement or an 
expression.

Cheers,
Adrian


From: bison-patches <bison-patches-bounces+avogelsgesang=tableau.com@gnu.org> 
on behalf of Akim Demaille <akim.demaille@gmail.com>
Date: Wednesday, 23. September 2020 at 08:00
To: Paul Eggert <eggert@cs.ucla.edu>
Cc: "bison-patches@gnu.org" <bison-patches@gnu.org>
Subject: Re: [PATCH 00/17] RFC: multiple start symbols

Hi Paul,

Thanks for the quick response!

> Le 21 sept. 2020 à 04:55, Paul Eggert <eggert@cs.ucla.edu> a écrit :
>
> My quick reaction is that this looks like a good thing to have, but the most 
> important part is missing: the documentation patches.

Agreed. I'm answering your questions below as a rehearsal of the genuine 
documentation to write: I have well understood your point was to guide me 
through points the documentation will have to cover, but prototyping here and 
getting your feedback will be useful!

> These should cover:
>
> * Why have multiple start symbols?

Personally I first needed this for a compiler of a language that features 
implementation files, and declaration files, both syntax sharing a lot of 
common traits, and the same AST implementation. So very similar to the RACK 
example, indeed.

> (See, for example, the motivating discussion for this feature on page 6 of 
> the RACK manual 
> <https://www.rand.org/content/dam/rand/pubs/notes/2009/N3100.pdf<https://www.rand.org/content/dam/rand/pubs/notes/2009/N3100.pdf>>.)

Wow, I really did not do my homework :( Thanks for the quick state of art! I 
should have done that. The only model I had was that of SGLR, which is really 
different though, since it does not support user actions (it directly builds 
the AST).

I ran an OCR on the RACK manual, and found the documentation of the feature in 
Section 7.5 p. 67. The interface is clumsy: the caller must set a global 
variable to the index of the start symbol to specify which one is used:

>> Start symbols are indexed, starting at zero, by the order in which they 
>> appear in the declaration. Thus, in the declaration
>>
>> %start prog stmt expr
>>
>> prog has start index 0, stint has start index 1, and expr has start index 2. 
>> The user tells the parser which start symbol to use by assigning the 
>> appropriate start index to the global variable yystartindex. The default 
>> value of this variable is zero.

There are #define to make this simpler.

>> #define PROG 0
>> #define STMT 1
>> #define EXPR 2

> ** Zyacc 
> <https://www.cs.binghamton.edu/~zdu/zyacc/doc/zyacc_4.html#SEC71<https://www.cs.binghamton.edu/~zdu/zyacc/doc/zyacc_4.html#SEC71>>

I like very much how the author formats the grammars.

>> %start infix, prefix, suffix

infix
: iExp '\n' %look(0) { return $1; }
;

This is weird: it looks that yyparse's return value is "that of the grammar". 
That matches the way this example calls yyparse:

>> /* Call infix grammar if line starts with 'i'; suffix grammar if line
>> * starts with a 's'; prefix grammar if line starts with a 'p'.
>> */
>> int main()
>> {
>> int c;
>> while ((c = getchar()) != EOF) {
>> int z = yyparse(c == 'i' ? infix : (c == 's' ? suffix : prefix));
>> printf("%d\n", z);
>> }
>> }

but I could not find any place explain the calling convention of yyparse this 
way. Maybe the author is just abusing the status of yyparse to return an int, 
which works just for this case.

Anyway, as far as I can tell, both RACK and zyacc "only" implement the dispatch 
on the multiple start symbols, and nothing else (not the Point 3 below). RACK 
does this via a global, and zyacc via an argument passed to yyparse.

> * Suppose Bison didn't have multiple start symbols; how could you achieve 
> their effect? (This is already in the Bison manual, I think.)

Yep, 
https://www.gnu.org/software/bison/manual/html_node/Multiple-start_002dsymbols.html#Multiple-start_002dsymbols<https://www.gnu.org/software/bison/manual/html_node/Multiple-start_002dsymbols.html#Multiple-start_002dsymbols>.
 It still features my original spellos from 2006...



> * What is the advantage of multiple start symbols over that simulation?

I see several advantages, of varying importance (to my eyes):


1. Simpler Grammar (not very important)

The grammar is simpler. In my running example (lexcalc, see my previous 
message), with my branch the grammar is:

0 $accept: YY_PARSE_NUM "number" $end
1 | YY_PARSE_expression expression $end
2 | YY_PARSE_input input $end

by hand the grammar would be

0 $accept: start $end
1 start: YY_PARSE_NUM "number"
1 | YY_PARSE_expression expression
2 | YY_PARSE_input input

this part took some work to do, but I really wanted "genuine multiple start 
symbols", not another layer of indirection between $accept and the user's 
grammar.


2. Simpler Scanner (quite significant)

Having the scanner pass the "switching token" it painful, and costly. Painful 
because it depends on the technology you use, and in the case of lex, it is 
requires some tricks. Painful because one would like to address this from the 
parser, but it's actually from the scanner that you have to do it. And costly, 
because chances are that your implementation will have to check for every 
single token whether it's the first one, in which case you have to return the 
switching-token. In other words, you get an additional if for every single 
token. Probably not measurable, but sounds wrong anyway.

My branch costs _nothing_. Really nothing. Indeed, in yacc.c (as in master) 
when yyparse is fired, the local variable yychar is initialized to YYEMPTY, 
which forces a call to yylex to get the first token. I only replace the initial 
value from YYEMPTY to the switching token. That's all that is needed!


3. Better Calling Conventions (large improvement imho)

But what I most enjoy is that now we have much better calling conventions with 
the various yyparse. The user no longer has to play dirty tricks with 
%parse-param to extract her AST from the parser. It's amazing that yyparse's 
return value is its status, not the actual result of the parse! So I'm quite 
happy with these functions:

> // Return type when parsing one expression.
> typedef struct
> {
> int yyvalue; // <==============
> int yystatus;
> int yynerrs;
> } yyparse_expression_t;
>
> // Parse one expression.
> yyparse_expression_t yyparse_expression (void);

where the type of yyvalue depends on the type of the symbol. So if you have a 
start symbol which resolves to Declarations, and another to Statements, you'd 
get

> typedef struct
> {
> Declarations yyvalue;
> int yystatus;
> int yynerrs;
> } yyparse_declarations_t;
>
> yyparse_declarations_t yyparse_declarations (void);


and

> typedef struct
> {
> Statements yyvalue;
> int yystatus;
> int yynerrs;
> } yyparse_statements_t;
>
> yyparse_statements_t yyparse_declarations (void);

In fact, I think these signatures are so much better than what we had so far 
that we should probably also offer it for the regular case.

But I may well have fallen for the overfitting bias: this might be very well 
fitted for me, but not for the general case. That's why I would like to get 
feedback from several people :(


> * Maybe contrast to how other Yacc-like parser generators do it? (Or at 
> least, steal good ideas from them. :-) These come to mind:
>
> ** RACK (see page 27 of its manual)
> ** Zyacc 
> <https://www.cs.binghamton.edu/~zdu/zyacc/doc/zyacc_4.html#SEC71<https://www.cs.binghamton.edu/~zdu/zyacc/doc/zyacc_4.html#SEC71>>

There's one striking difference: they offer one single function which takes the 
switching token, while I have hidden all these "details" into several 
functions. Of course, these functions (yyparse_declarations, 
yyparse_declarations, ...) rely on another function, similar to that of RACK 
and zyacc, which takes a switching-token as argument.

> // Extract data from the parser.
> typedef struct
> {
> YYSTYPE yyvalue;
> int yynerrs;
> } yyparse_yyimpl_t;
>
> // Run a full parse, using YYCHAR as switching token.
> static int
> yyparse_yyimpl (int yychar, yyparse_yyimpl_t *yyimpl);


It might be useful for some use cases to exhibit yyparse_yyimpl (which means 
choosing a public name, and also rename the switching token from YY_PARSE_FOO 
to YYPARSE_FOO, since I try to keep private details under YY_*/yy_*, and public 
ones in YY*/yy*).

I don't know about that. Feedback would be most useful!

Cheers!

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 10/17] multistart: equip yacc.c, (continued)
- [PATCH 10/17] multistart: equip yacc.c, Akim Demaille, 2020/09/20
- [PATCH 11/17] multistart: toy with it in lexcalc, Akim Demaille, 2020/09/20
- [PATCH 12/17] todo: more, Akim Demaille, 2020/09/20
- [PATCH 13/17] multistart: adjust reader checks for generated rules, Akim Demaille, 2020/09/20
- [PATCH 14/17] multistart: use b4_accept instead of action post-processing, Akim Demaille, 2020/09/20
- [PATCH 15/17] multistart: allow tokens as start symbols, Akim Demaille, 2020/09/20
- [PATCH 16/17] yacc.c: also count calls to YYERROR in yynerrs, Akim Demaille, 2020/09/20
- [PATCH 17/17] multistart: also give access to yynerrs, Akim Demaille, 2020/09/20
- Re: [PATCH 00/17] RFC: multiple start symbols, Paul Eggert, 2020/09/20
  - Re: [PATCH 00/17] RFC: multiple start symbols, Akim Demaille, 2020/09/23
    - Re: [PATCH 00/17] RFC: multiple start symbols, Adrian Vogelsgesang <=
    - Re: [PATCH 00/17] RFC: multiple start symbols, Akim Demaille, 2020/09/27
    - Re: [PATCH 00/17] RFC: multiple start symbols, Rici Lake, 2020/09/27
    - Re: multistart: returning structs, Akim Demaille, 2020/09/29
    - Re: multistart: yynerrs, Akim Demaille, 2020/09/29
    - Re: multistart: free choice of the start symbol, Akim Demaille, 2020/09/29

Prev by Date: Re: [PATCH 00/17] RFC: multiple start symbols
Next by Date: [PATCH for Dlang support] tests: d: check the extends and implements directives for the parser class
Previous by thread: Re: [PATCH 00/17] RFC: multiple start symbols
Next by thread: Re: [PATCH 00/17] RFC: multiple start symbols
Index(es):
- Date
- Thread