help-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parsing a language with optional spaces


From: Akim Demaille
Subject: Re: Parsing a language with optional spaces
Date: Sun, 26 Jul 2020 10:56:32 +0200

Hi Christian,

> Le 20 juil. 2020 à 14:51, Christian Schoenebeck <schoenebeck@crudebyte.com> a 
> écrit :
> 
> On Samstag, 18. Juli 2020 08:47:07 CEST Akim Demaille wrote:
>>> Le 14 juil. 2020 à 13:31, Christian Schoenebeck
>>> <schoenebeck@crudebyte.com> a écrit :
>>> 
>>> On Montag, 13. Juli 2020 07:56:57 CEST Akim Demaille wrote:
>>> 
>>> For the 'married' parser, in the form imagined by me, there would be no
>>> lookahead token, as a lookahead token implies a context-unaware scanner.
>>> 
>>> Instead, when the parser would get to a state where it had to decide
>>> between reducing or shifting (depending on a next token), it would not
>>> decide and fork the parser state instead, in order to allow
>>> context-specific tokenization.
>> That's indeed a possible choice in the implementation of GLR parser
>> generators: to sit GLR on top of LR(0).  I've never seen performance
>> reports that compare both approach though.
>> 
>> However, most implementations use SLR(1), because it is super simple to
>> implement (especially compared to LALR(1)), yet make a significant
>> difference.
> 
> Forking at this point would merely be an implementation trick to buy the 
> powerful feature of context-aware scanning. Those forks would be very short-
> lived though as they would usually already be collapsed in the next step.

That's not the point.  LR(0) is *really* limited.  It cannot even grok
binary operators.  So you would spend a significant amount of your
time getting things wrong.

Not to mention that you would have to implement the support of %assoc
and %prec in the parser itself.  Currently in Bison there is nothing to do
that.  If you want something like "%left '+'" to work in LR(0), then
you will have to design new tables and new code in the generated parser
to deal with that.

So you'd have many many small forks, which would probably stack, so basically
you would seldom benefit from the optimized deterministic loop with glr.c.
Which could result in dramatic memory penalties with several concurrent
stacks.

SLR(1) is really simple to implement, with an automaton of the size of
LR(0).  So I don't see any value in sticking to LR(0).

> So the theoretical overhead of this would probably be like O(c n^3) instead 
> of 
> O(n^3), and assuming that c would be constant, that overhead would probably 
> be 
> neglectable.

This is where we disagree :)  Sure, "neglectable" is subjective, but then,
as the maintainer, I have to have stricter requirements.  I have to aim
at the strictest requirements of Bison users, not the weakest, or something
in the middle.

You know, PostgreSQL people consider their Bison-generated parsers are slow,
and they are looking for ways to speed it up.  I'm sure they are not the
only ones.
https://www.postgresql.org/message-id/CACPNZCvzAjiK-CP-qcwmqTrfwM039AGDw2Bq3_aboRot0m1NGA@mail.gmail.com



> What I am wondering, independent of implementation aspects, would there be an 
> automated way (from theoretical POV) to identity the minimum required amount 
> of k (a.k.a. amount of lookahead tokens) for _individual_ parts of the 
> language to allow a variable k for one language, in order to find a good 
> compromise between runtime efficiency and flexibility, and yet being easy to 
> use?

In my experience, this point is little practical value.  It's interesting
from the theoretical point of view, but little to no value in practice.
For at least two reasons.

First one is that even when this k exists, the implementation is a nightmare.
Using a global k, full LR(k) parsing, is simple to build, it's quite a
straightforward generalization of LR(1), but then you very quickly get
huge automata, and you have to stick to a small k.  Local `k`, well, I
don't know of any successful algorithm to do that for LR (it's quite easier
in LL), and I'm certainly not willing to have to maintain such a beast.
Not just generating the automaton, but also adjusting the skeletons to
run it.

Note that any LR(k) grammar can be turned into an LR(1) grammar.  In
practice, people do that by hand (Adrian already explained his trick, and
I described how Bison deals with its LR(2) grammar to parse it anyway).  
In that sense, people are already doing adaptive LR(k) parsing.
When they can.

And second reason, a fixed k is not appropriate in many situations.  Consider
the case of the Tiger programming language in which you instantiate
array types this way:

array_of_int[42] of 51

where 42 is the size of the array, and 51 the initial value of these 42
values.

In Tiger you also have the usual syntax to read inside an array:

my_array[42]

The "natural" rules are

exp
: typeid "[" exp "]" "of" exp
| lvalue

lvalue
: "id"
| lvalue "[" exp "]"

typeid
:  "id"


You can easily see that this is not LR(1): you need to read the "of"
to know whether you're in case 1 or case 2, and LR(1) only see the "[".
You might think it's LR(4) then, but no it is not, because of course
the size of the array/the index are *expressions*, of unbounded length,

array_of_int[0 + 1 + 2 + .. + 65535] of 0
my_array[0 + 1 + 2 + .. + 65535]

C++ is also well known for having such a need for unbound lookahead.
The section about GLR in Bison's manual has a few other examples.
https://www.gnu.org/software/bison/manual/bison.html#GLR-Parsers

So, as far as I'm concerned, anything between LR(1) and GLR (or something
else that supports ambiguities and unbounded lookahead) is a waste
of energy.


> I did not mean the core automat of Bison. I simply had the impression that 
> when it comes to new features, the large number of existing Bison use cases 
> and all their differences in individual Bison configurations ("explosions") 
> add such kind of maintenance complexity, that it seems to significantly 
> constraint (or at least vastly slow down) the potential development of new 
> Bison features.

You are very right that I sometimes get mad when I have to write solid
test cases because of all these combinations.  But the automaton or the
parsing algorithm is hardly the problem here.  It is the combination of
"superficial features" (such as %parse-params, name prefixes, different
error reporting approaches, etc.) that make it tricky.  Because they
interact with one another.



>> Many many people *want* a *deterministic* and *fast* parser.
>> 
>> They want determinism because with GLR (with conflicts) you just don't know
>> if some day your parser will fail because of an unforeseen ambiguity.  I
> 
> With 'fail' in regards to GLR you mean what exactly, memory exhaustion or 
> rather sequence of user actions being executed in wrong order, or something 
> else?

I mean "fail" in the sense of ending up with one source with multiple parses.
When Bison, in LR mode, tells you your parser has no conflict, then you
know you are safe: there is just one parse for any accepted input.  (Of
course the grammar can be wrong, but that's a different issue.)

When you relax the constraint of being conflict-free, then you are exposed
to discover some day some syntactic construct that can be read in two
different ways.

So, whenever you go GLR, of course you know you have conflicts (otherwise
you'd stick to deterministic parsing), but you are now taking chances.
Unless you know very well your conflicts.

Having a deterministic parser is a guarantee that your grammar is sane,
and that you will be fast.


> But the reality is also a 'bit' different, as many languages are not truly 
> LALR(1) with context-free tokens to full extent. So what happens is that 
> people usually make hacks to fit their language into that concept (e.g. 
> pushing/popping states on scanner side, bloating tokenization rules, hand 
> crafted procedural tokenization code e.g. in C, etc). And that's where 
> determinism often proofed to get broken, and is often hard to be fixed 
> (manually) and maintained as well, as there is always a danger of other edge 
> cases still being present.

You are very right.

But some of this tricks _can_ be addressed while still being deterministic.
And that's exactly what Joel's dissertation demonstrated.  There _is_
value in "context-sensitive lexical analysis" in LR parsing (I dislike
using "context-sensitive" here, because that's not what context-sensitive
grammars are about, we are still in the realm of context-free grammars,
but it conveys the right idea of what the expressive power the scanner gains).

So far this idea was mostly implemented by scannerless parsers.  PSLR(1)
is precisely not scannerless.


> I still prefer a machine handling complexities than relying on humans to 
> understand and handle large number of possibilities reliably. Because the 
> larger the amount of possibilities, the higher the chance human loses that 
> battle against a machine.

Agreed.  But with LR(1) Knuth made us win a significant battle in this war,
and I don't want to lose what we gained.


> To sort out a misapprehension: I was not suggesting to ditch LALR(1) and/or 
> Bison 3.x for good. Not at all. They are still the most important parsers for 
> good reasons and will continue to do so.

Ok :)  That was indeed not so clear to me.


> But I do think that after 40 years, it is legitimate to consider conceptual 
> alternatives alongside. Simply comparing O(n) vs. O(n^3) and that been taken 
> as reason 40 years ago, should not be a reason to immediately discard 
> questioning the status quo today. Those would be worst case considerations 
> anyway.

I disagree.

O(n) is still vastly superior to O(n^3).  As a matter of fact too many people
are now sloppy with performances today, and computers now represent a huge
amount of energy consumption on this planet.  Throwing away good optimizations
is pollution.

However I do agree that it makes so new options affordable.  What used to
be impossible because of the memory size requirements and/or amount of
computation is now possible.  So of course I agree we should benefit from
these new opportunities.  But not at the expense of well established
practices.

> For instance the concept of deep neuronal networks (AI) also existed decades 
> ago, and despite of what individual companies claim, the theoretical concepts 
> have not really improved much since then, and still we are now seeing real 
> life applications for them everywhere. Simply because today we do have the 
> appropriate hardware to run them, despite e.g. still having the same horrible 
> complexity/inefficiency of training such networks.

I fully agree.


> Bison 4 is probably still far ahead. So some discussions in the meantime 
> could 
> be useful. I am also considering to write a proof of concept code, toying 
> around with some (core) concept ideas and see where it goes. Maybe it just 
> ends in trash, or maybe not. We will see.

You are courageous :)


>>> Besides of what we already discussed, I think it would make sense to get
>>> rid of a bunch of other things that Bison still retains:
>>> 
>>> 1. No longer multiple parser types, only one: a somewhat GLR-alike parser
>>> like> 
>>>  discussed as real superset of LALR, but without lookahead token (so not
>>>  really LALR based under the hood).

[That's the kind of sentence that made me believe you were considering
throwing away the deterministic parsers in Bison.]


>> Besides, if people want to look for brand new trendy techniques, why would
>> they come to Bison?  There are plenty of alternatives out there, and if
> 
> For instance?

ANTLR comes first to my mind, and generators for PEG have flourished everywhere
these past years.  The wikipedia page is endless,
https://en.wikipedia.org/wiki/Comparison_of_parser_generators.

> The rest is more about things 
> 'around', like support for specific programming languages, or EBNF instead of 
> BNF, etc.

(That would be really nice...  Menhir's support for parameterized rules is
brillant.)


>> Bison still exists today, it's because there's already tons of software
>> that rely on its current behavior.  I will *not* break that.  *Never*.
>> That's our "raison d'être" (some English native should check my use of
>> French expressions, I'm not I'm using it properly :-).
>> 
>> I can add *more* options, provide better alternatives, but Bison cannot
>> afford to drop features just like that.
> 
> Please do, I would appreciate if you share ideas!

You mean, ideas of new features?  I have cited a bunch of them already,
and others are in the TODO file, but currently those that are close to my
heart include:

- support for multiple start symbols
- integration of the scanner
- scoped precedence/associativity directives
- grammar modules
- rewritten glr.cc
- elimination of chain rules
- tailored error messages
  http://gallium.inria.fr/~fpottier/publis/fpottier-reachability-cc2016.pdf
etc.


>>> 3. No longer aggressive compression of parser states (which is somewhat
>>>  already implied by [2.]), because such compressions lead to important
>>>  information loss when it comes to grammar debugging or auto completion
>>>  features and other user relevant purposes. The motivation that lead to
>>>  such compressions (very low RAM space) no longer exist in like >99% of use
>>>  cases today.
>> 
>> Come on!  We all know that today the very same constraints ("be small")
>> still apply, but indeed for completely different reasons.  Compressing used
>> to be about fitting in the RAM, today it's about fitting in the cache.

[I should have mentioned that users can already discard such optimizations
when they go against their usage.  That the "Default Reductions" node in
the doc.]

> Today's memory optimization strategy is usually not trying to fit everything 
> into cache, but rather doing cache-line optimizations: Alligning data to 
> certain memory boundaries, choosing appropriate layouts for data structures 
> (e.g. appropriate dimensions for matrices) and taking care of code portions 
> that are executed for a certain longer period of time of only accessing a 
> limited set of cache lines during such individual time slices.

It is true.  That's even better.

> Whether these considerations make sense for a parser generator on modern (and 
> not low end) devices is questionable though, as it probably does not affect 
> much overall efficiency of applications in practice.

I agree it should be measured.


>> Not to mention lower-end devices.
> 
> Low end devices would probably still use LALR(1) instead, yes. They 'could' 
> benefit from this as well though, if their language is mostly LALR(1) and 
> probably only a few, limited exceptions here and there, they could benefit 
> from reduced maintenance complexity (e.g. their language still evolving) 
> without (or little) negative runtime impact.

You're describing "GLALR(1)" (or "GIELR(1)" or even "GLR(1)") and I agree
with this approach, I like it very much.  It's "GLR(0)" that I dislike.
And also going G uselessly.


>> I don't believe in the Unique True Omnipotent Parser Yielder (let's call it
>> UTOPY for short).  There are far too many different use cases.  
> 
> Yep, I totally agree with that. Bison, as it is now, has and will have its 
> place, but I can also consider that many use cases might prefer another 
> solution in future.

Agreed.

> Because in many applications today, the runtime efficiency 
> of the core parser is not the biggest factor of the application's overall 
> runtime. For instance a modern compiler spends most of its execution time on 
> its backend, or at least in general on tasks after an initial AST became 
> available.

Agreed.

> I rather got the impression, that many companies today have other problems: 
> acquiring appropriate personnel being capable to deal with LALR(1) parser 
> generators. I also see many developers still writing all their parsers 
> manually, as they told me they found it too hard for them using generators. 
> So 
> these would probably the ones profiting the most.

The most frequent cited good reasons to write parsers by hand is:

1. speed
  Yes, hand written are usually faster, mostly because they are genuine
  code, that compiler optimize much better than table driven algorithms.

2. flexibility
  If you need some more lookahead for a while to deal with local ambiguities,
  that's often not so hard

3. tailored error messages
  Instead of some dummy "unexpected foo, expected bar or baz", writing
  something really helping.

PSLR and GLR aim at 2.  I don't want to lose more on 1.  3 is something
I referred to above (François Pottier's paper).

Generators on the other are great at analyzing your grammar and telling you
things about it.  With the recent addition of -Wcex, Bison made a significant
step forward on this regard.  The generated reports are great tools too.


>> Some want
>> to change the language dynamically, some want determinism, some what
>> something DOM-oriented, not SAX-like, some have strong CPU/time
>> constraints, some don't care, some experiment others need to be production
>> ready, etc., etc., etc.
>> 
>> Bison fits into a niche, and it would be an error to depart from this.
> 
> I wouldn't call it 'niche'. That would still be a big niche frankly. ;-)

You're right, let's make it a "mansion" (in French, the first meaning of
"niche" is "dog house").

Cheers!


reply via email to

[Prev in Thread] Current Thread [Next in Thread]