Re: Which lexer do people use?

From: Adrian Vogelsgesang
Subject: Re: Which lexer do people use?
Date: Sat, 4 Jul 2020 19:30:36 +0000
Hi Daniele,

> Which other scanners do people use?
For what it’s worth, we are using a hand-rolled scanner. Seemed just the 
fastest way to get rolling and the easiest to maintain.

Also, it allowed us to embed a few hacks directly inside the scanner: E.g. in a 
few places our grammar is not actually LR1. Only in very few edge cases, 
though, so that we don’t want to use GLR. Hence, our scanner does a lookahead 
and, e.g., upon encountering the token “WITH” looks at the following token. If 
the next token is “TIMESTAMP”, it produces “WITH_LA” instead of just “WITH”. 
Thereby, we get 1 look-ahead from the scanner. Combined with the 1 lookahead 
provided by bison, we can now parse our LR2 grammar.

Not sure if this would have been possible also with flex – but given we have a 
hand-rolled parser it was straightforward.

You can find a similar hack also in,
 if you look for the WITH_LA keywords. Postgres is using a flex scanner and 
then stacks a custom layer between flex and bison which introduces the 
additional maintenance overhead.


the historical pairing is using Flex with Bison. However, while Bison is
under active development and seems to be a very solid code base, there
isn't much activity on the Flex side<> and
Flex codebase and capabilities show their age.

I recently became aware of RE/flex<>
which seems very promising. However, it only generates a C++ scanner
which may be (I haven't tried) to retro-fit into existing C projects to,
for example, gain full unicode (in its utf8 encoded form) support.

Has anyone tried to hammer a C++ scanner peg generated by RE/flex into a
C grammar hole generated by Bison?

Which other scanners do people use?

