Re: [Grammatica-users] Easier lexer generation.

grammatica-users
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Grammatica-users] Easier lexer generation.

From:	Matti Katila
Subject:	Re: [Grammatica-users] Easier lexer generation.
Date:	Tue, 12 Jul 2005 13:31:27 +0300 (EEST)
Hello Per,

I have turn over this issue of modularizing grammars for a while and try
to come up with few solution plans. I try to explain these different
point of views now.


Importing grammars
------------------

First case is the one you have already though, i.e., import grammar items.
https://savannah.nongnu.org/bugs/?func=detailitem&item_id=8437

%import myname regexp.grammar%

Token = "<<" myname.RegExp ">>" ;

To speak clearly this sounds like a solution to one familiar with
programming languages. Some negative things I can find are:
- The solution would force grammar files being available. If you got an
  arbitrary jar file containing a parser you may not use it.
- The grammar files would became messy - we would lose in readability at
  least in some sense.
- And opposite to one of my solution ideas, every token which will be used
  later hand needs to be declared int the earliest tokenizer. To explain
  more about this I will give an example of Java and Java documentation.

    If one is parsing java source code they might use some regexp pattern
    to tokenize java documentation, i.e., /** foo etc. */ , and as for
    context information the following class, attribute, method or similar
    thing.

    Right, so we got all documentation within one token, nice. But how are
    we going to parse that information? It might contain html tags etc.
    What I'm trying to say: all those html tag tokens needs to be declared
    in the java grammar if we are going to import them into javadoc
    grammar. I'm sure that java grammar is not the right place to declare
    some html stuff.


Token Streams
-------------

I think as an idea the token streams sound nice, see
http://www.antlr.org/doc/streams.html

In the previous example of java documentation parsing I think the
documentation token needs to be translated back to chars. For example

String doc = javadocToken.getImage()
Reader reader = new StringReader(doc);

The reader can be given to javadoc parser as it is.

Ok, this was not an example of token streams. Currently I couldn't
find any use for those :o) You use analyzers with Grammatica.


WrapperTokenizer and Placeholders
---------------------------------

Currently I'm working on wrapper tokenizer. I'll give a code example:

   Parser lexer = new FooLexer_Parser(new FileReader(grammarFile));

   Parser fooParser = new FooParser_Parser(lexer);

You may wonder what kind of constructor of Parser instance takes another
parser as a parameter. Well I use wrapping.

   FooParser_Parser(Parser parser) {
       super(null, new Analyzer());
       this.tokenizer = new WrapperTokenizer(parser);
   }

Now WrapperTokenizer extends Tokenizer and do stuff like Token next().

This all sounds very easy but I need to be honest and remark that I am
faced to same problem of token and production values that was solved
in import grammar section. Let's go on with examples again:

--- foo lexer grammar ---
%tokens%
a = "a"
b = "b"
c = "c"
space = " "
EOF = ""

%productions%
S         = (space | straight | twomix)* EOF ;
straight  = a+ | b+ | c+ ;
twomix    = ( [b+] a+ b+ )+
          | ( [c+] a+ c+ )+
          | ( [b+] c+ b+ )+ ;


--- foo parser grammar ---
%header%
TOKEN_WRAPPING = "true"
// same tokens as in lexer
%productions%
S = (blim_expr | blom_expr) EOF ;
blim_expr = straight space straight space ;
blom_expr = twomix space twomix space ;

The token wrapping declaration turns on switch on parser generator which
now won't create id(entifier)s for tokens or prductions but request it
from WrapperTokenizer. The idea is to generate a code like this:

        pattern = new ProductionPattern(FooLexer_Constants.BLIM_EXPR,
                                        "BLIM_EXPR");
        alt = new ProductionPatternAlternative();
        alt.addToken(tokenizer.getTokenIdForName("STRAIGHT"), 1,1);
        alt.addToken(tokenizer.getTokenIdForName("SPACE"), 1, 1);
        alt.addToken(tokenizer.getTokenIdForName("STRAIGHT"), 1,1);
        alt.addToken(tokenizer.getTokenIdForName("SPACE"), 1, 1);
        pattern.addAlternative(alt);
        addPattern(pattern);

Or actually I suggest that Grammatica should use "blim_expr" instead of
"BLIM_EXPR". We are using a programming language where case is
significant. Even if there's a advice that definitions should be typed in
upper case - we lose in readability again, see:

   stringliteral ::=
             [stringprefix](shortstring | longstring)

   stringprefix ::=
             "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR"

Now i need to declare some cryptic tokens like UR_1 and UR_2. Let's see
how those look in analyzer:

     enterUR_1(...)
     enterUR_2(...)
     enterUR_3(...)
     exitUR_1(...)
     exitUR_2(...)
     exitUR_3(...)

Or actually I don't understand why I need to create token declarations for
some things at all but not automatically by generator.

Back to example, or actually the idea of placeholder. Now let's modify the
parser grammar a bit:

%header%
TOKEN_WRAPPING = "true"
// same tokens as in lexer
%productions%
S = (if_one | if_two) EOF ;
if_one = "if" space+ straight ":" space+ twomix
if_two = "if" space+ twomix ":" space+ straight

As you can remember, there's no "if" or ":" tokens declared in lexer. The
easiest fix would be to add them there but I think better idea is to use
placeholder.

--- foo lexer grammar ---
%tokens%
a = "a"
b = "b"
c = "c"
space = " "
EOF = ""

placeholder = "some text hardly seen in source file" %placeholder%

%productions%
// add placeholder in grammar
S         = (placeholder | space | straight | twomix)* EOF ;
straight  = a+ | b+ | c+ ;
twomix    = ( [b+] a+ b+ )+
          | ( [c+] a+ c+ )+
          | ( [b+] c+ b+ )+ ;


--- foo parser grammar ---
%header%
TOKEN_WRAPPING = "true"
// same tokens as in lexer
// and perhaps create some own tokens too.
%productions%
S = (if_one | if_two) EOF ;
if_one = "if" space+ straight ":" space+ twomix
if_two = "if" space+ twomix ":" space+ straight

Now before parsing missing tokens are added to FooLexer_Parser. For
example like this:

In Parser.java:
add   protected ProductionPatternAlternative placeholder = null;

When generating an alternative for placeholder write normal code and add
assignment before that:

    this.placeholder = alt = new ProductionPatternAlternative(...);

Later you can add anything into that placeholder, e.g., these "if" and
":" tokens introduced in the final grammar.

lexer.getPlaceholder().clear();

// then create normal pattern for "if" and ":" which is added to bar
// pattern, and thus to lexer too.

lexer.getPlaceholder.addProduction( bar );

Well, there are issues of course, for example when there are wrapping
of wrapped parser instance but I'm trying to glue things together now and
let's see what it comes up.


   -Matti



On Tue, 28 Jun 2005, Per Cederberg wrote:
> What you are suggesting is actually quite similar to an old
> future improvement suggestion for Grammatica:
> https://savannah.nongnu.org/bugs/?func=detailitem&item_id=3599
> That is interesting for more purposes though, allowing the
> creation of very readable and compact grammars. Why aren't
> grammars modularized and abstracted the same way software is?
>
> I guess it would be possible to do already today, but one sure
> needs some glue handy. The Parser class require something that
> behaves like a Tokenizer (which is unfortunately not an
> interface), so one has to implement the public API of that
> class. Also, as a parser outputs a parse tree, and not a stream
> of tokens, one would have to create a special Analyzer subclass
> that converts certain (but obviously not all) productions into
> Tokens. Not that tricky maybe, but perhaps not all that
> beautiful either.
>
> Cheers,
>
> /Per
>
> On tue, 2005-06-28 at 12:00 +0300, Matti Katila wrote:
> > On Tue, 28 Jun 2005, Per Cederberg wrote:
> > > Well, the Grammatica lexer is essentially identical to most other
> > > tools. It provides support for defining tokens via regexps (and
> > > plain strings). If you want to use EBNF, you can just write those
> > > "tokens" as productions instead.
> >
> > This would make the division to lexer and parser much harder. For
> > example python lexer which marks_ indent and dedent is easy to do after
> > lexing - we may call that post-lexer. Post-lexing is much easier to do
> > after long literals_ and comments are regocnized already.
> >
> > _marks =: http://docs.python.org/ref/indentation.html
> > _literals =: http://docs.python.org/ref/strings.html
> >
> > Of course, I don't know yet since I haven't tried it out, I could create
> > foolexer.grammar and fooparser.grammar for grammar foo and use some sort
> > of wrapping with grammatica to use lexer.grammar as lexer where I want to
> > use ebnf in lexer grammar.
> >
> > > The general problem with grammars where tokens are specified in
> > > EBNF is efficiency and ambiguities.
> >
> > Efficiency sounds as premature-optimizing which shall considered as no-no :)
> >
> > > That is not a problem in a language specification document where
> > > readability is the highest concern.
> >
> > I think readability is high concern in every case :) Well, I think I know
> > what you mean.
> >
> >
> >    -Matti
>
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Grammatica-users] Easier lexer generation., Matti Katila <=
Prev by Date: Re: AW: [Grammatica-users] Fuzzy tokenizer.
Next by Date: [Grammatica-users] Want to Use the PArser as non-deterministic for Natural Language Processing
Previous by thread: AW: [Grammatica-users] Fuzzy tokenizer.
Next by thread: [Grammatica-users] Want to Use the PArser as non-deterministic for Natural Language Processing
Index(es):
- Date
- Thread