[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing a language with optional spaces

From: Akim Demaille
Subject: Re: Parsing a language with optional spaces
Date: Sat, 18 Jul 2020 08:47:07 +0200

> Le 14 juil. 2020 à 13:31, Christian Schoenebeck <> a 
> écrit :
> On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
> For the 'married' parser, in the form imagined by me, there would be no 
> lookahead token, as a lookahead token implies a context-unaware scanner.
> Instead, when the parser would get to a state where it had to decide between 
> reducing or shifting (depending on a next token), it would not decide and 
> fork 
> the parser state instead, in order to allow context-specific tokenization.

That's indeed a possible choice in the implementation of GLR parser generators:
to sit GLR on top of LR(0).  I've never seen performance reports that compare
both approach though.

However, most implementations use SLR(1), because it is super simple to
implement (especially compared to LALR(1)), yet make a significant difference.

>> Yes, indeed.  That's something which Joel detailed in his PhD.  Note
>> though that PSLR targets deterministic parsing.  Bison would (first)
>> target that, not GLR.  But contributions are always welcome :)
> The longer I think about it, the more I wonder whether it wouldn't make sense 
> to make a clear and radical cut with the past when developing a 'married' 
> parser to allow unfolding its full potential.
> I mean I look at what you are working on Akim, and AFAICS the majority of 
> time 
> you have to spend on maintaining (i.e. not breaking) backward compatibility 
> for ancient use cases. Most of Bison's conceptional design and internals are 
> still based on requirements and opinions from around 1980 which simply no 
> longer exist or proofed wrong.

I don't subscribe to this opinion :)  The core of Bison is well established,
and I have very little maintenance to do on it.  As a matter of fact, if you
look at the recent history, there's hardly any work in the automaton
construction.  Most of my recent efforts are about diagnostics.  Diagnostics
about the grammar (which is also what the current work on counterexamples is
about), and in the generated parsers (to provide better error messages).

Many many people *want* a *deterministic* and *fast* parser.

They want determinism because with GLR (with conflicts) you just don't know
if some day your parser will fail because of an unforeseen ambiguity.  I
know people who build critical software with Bison, and "possible failure"
on valid input is not an option.  Bison is strong on this regard.  IELR
is arguably the most powerful form of LR(1) parsing you can find on the market
with the size of LR(0).  And you can have canonical LR(1) if you want.

GLR is slow compared to LR.  People rely on our generated LR parsers being
fast. That's a true feature of Bison's.  Note only does determinism provide
people with guarantees ("your grammar is unambiguous"), it also provides
them with efficiency.

Although I don't have figures about "blind GLR" (i.e., seated on LR(0)),
I'm pretty sure it would disastrous for anything serious.  So you'd have to
go at least to SLR(1).  But then, what would be the point of throwing away
all the good technology we have for LALR(1), IELR(1) and canonical LR(1)?

If I were to build a GLR parser from scratch today, I would certainly
aim at SLR(1), you are right, LALR(1) is a nightmare, but Robert Corbett
fought that fight some 40 years ago, and IELR(1) was certainly super tricky,
but Joel E. Denny won that battle more than 10 years ago.  Why would I
discard so great achievements?

> Besides of what we already discussed, I think it would make sense to get rid 
> of a bunch of other things that Bison still retains:
> 1. No longer multiple parser types, only one: a somewhat GLR-alike parser 
> like 
>   discussed as real superset of LALR, but without lookahead token (so not 
>   really LALR based under the hood). Theoretically there is no disadvantage 
>   by doing so, because if a developer feeds it with a LALR(1) capable
>       grammer, it would still end up having the same parsing 
>   complexity=efficiency as an old-school LALR would do, probably just with 
>       slightly higher memory footprint.
>   Advantages though: much more user friendly and powerful, and a lot less 
>   code to maintain.

I bet we are talking about making the parsers at least one order of magnitude
slower here (blind GLR).  Of course this not an option.

Besides, if people want to look for brand new trendy techniques, why would
they come to Bison?  There are plenty of alternatives out there, and if
Bison still exists today, it's because there's already tons of software
that rely on its current behavior.  I will *not* break that.  *Never*.
That's our "raison d'être" (some English native should check my use of
French expressions, I'm not I'm using it properly :-).

I can add *more* options, provide better alternatives, but Bison cannot
afford to drop features just like that.

> 2. No longer raw C tables as pregenerated parser and scanner tables, instead 
>   of such C-tables: an intermediate, dynamic in-memory model with high-level 
>   API allowing to access and modify grammar and patterns on-the-fly.
>   Advantage: it would allow self-extensible languages (e.g. an input that 
>   adds keywords, patterns, drops or adds grammar rules while being parsed).
>       And implementation specific things would finally clearly be isolated by 
>   introducing such an API -> Easier to maintain.
>       Disadvantages: Probably none, except of consuming slightly more memory 
> for 
>   meta information.

This is interesting, but that's not Bison you are talking about.

> 3. No longer aggressive compression of parser states (which is somewhat 
>   already implied by [2.]), because such compressions lead to important 
>   information loss when it comes to grammar debugging or auto completion 
>   features and other user relevant purposes. The motivation that lead to such 
>       compressions (very low RAM space) no longer exist in like >99% of use 
> cases 
>   today.

Come on!  We all know that today the very same constraints ("be small") still 
but indeed for completely different reasons.  Compressing used to be about
fitting in the RAM, today it's about fitting in the cache.  Not to mention
lower-end devices.

I don't believe in the Unique True Omnipotent Parser Yielder (let's call it
UTOPY for short).  There are far too many different use cases.  Some want
to change the language dynamically, some want determinism, some what something
DOM-oriented, not SAX-like, some have strong CPU/time constraints, some don't
care, some experiment others need to be production ready, etc., etc., etc.

Bison fits into a niche, and it would be an error to depart from this.

> So these concepts are probably too radical for trying to still fit them into 
> Bison's current (3.x) design.

Don't expect that in Bison 4 either.

>> Parsers such as SGLR have stacks which are graphs, while Bison's is
>> always a tree.  All GLR parsers manage to share the past of concurrent
>> stacks, they don't have a set of separate stacks.  Bison's GLR stacks,
>> of course, share their common past, it's stored only once.  I believe
>> SGLR is also able to share the common "future", so their stacks are
>> no longer trees.  This allows SGLR to keep decent performances when parsing
>> ambiguous grammars and you want to fetch the resulting parse forest.
> Ok, I read it as Bison's current GLR parser probably not having seen much 
> attention in the past.

No, it's also about the targeted model.  Most other GLR parser generators
address a somewhat simpler issue: their parsers generate the final AST.
They are DOM-parsers.  Bison is more like SAX-parsers: we run user actions.
Memory management is really a problem here.

If you wish, Bison's GLR is tailored for "mostly deterministic grammars".
It's point is more to grant us with infinite lookahead, rather than a
means to cover ambiguous grammars. 

At least that's the mental picture I have to Paul Hilfinger's work on
GLR.  And AFAICT, there were never any complains about that.

> BTW, from my perspective a set of parser stacks (with shared past) is IMO 
> still a tree, however a tree with 3 dimensions instead of 2. A 'graph' for me 
> is something much more generic, as a graph allows much more complex 
> structures 
> (with less restrictions) like traversing to any other state, which is not 
> possible in a tree.

We agree here.  You seem to be answering to something I did not mean, so
I venture I was unclear.

Bison's GLR stack is a tree.  SGLR's is a DAG.  Bison's GLR stacks
only have forks, SGLR's also have joins.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]