AW: [Grammatica-users] Tokenizer problem

grammatica-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: [Grammatica-users] Tokenizer problem

From:	HECKHAUSEN Ralf
Subject:	AW: [Grammatica-users] Tokenizer problem
Date:	Fri, 1 Jul 2005 16:04:22 +0200

Well, I now understand how the problem is caused. Please give me a hint how
to solve the following:

%header%
GRAMMARTYPE = "LL"
%tokens%
STATION_ID = <<[A-Z]{4}>>
SPACE = " "
DZ = "DZ"
RA = "RA"
SN = "SN"
%productions%
INPUT = STATION_ID SPACE PHENOMEN [PHENOMEN [PHENOMEN]];
PHENOMEN = DZ | RA | SN; // real list has 22 items

ABCD DZRASN is not parsed correctly, because DZRA is returned as STATION_ID
token. 
Defining
"PHENOMEN = DZ | RA | SN | STATION_ID;"
is not a solution in this case, as it would allow invalid input.

Defining STATION_ID as LETTER LETTER LETTER LETTER would fail on stations
containig on of the phenomens.

Cheers.
Ralf

________________________________

Von: address@hidden im
Auftrag von Per Cederberg
Gesendet: Fr 01/07/2005 14:35
An: address@hidden
Betreff: Re: [Grammatica-users] Tokenizer problem

Just adding on to the previous answer:

The problem is that the Tokenizer always returns the longest
matching token. So for the input "GO" the COMMAND token will
always be returned, as the LETTER token is only one character
long.

There are at least two more alternative solutions to this:

2) Create a new token for words:

   COMMAND = "GO"
   WORD = <<[A-Z]+>>

   This will always return the WORD token except for the input
   "GO" (so "GO GO" will still cause parse errors). Please note
   that the ordering of the tokens is important in that case.

3) Modify the INPUT production to handle "GO" tokens:

   INPUT = COMMAND SPACE DETAILS+ ;

   DETAILS = LETTER
           | "GO" ;

Hope this helps! (And thanks to Anant for helping out answering
questions on this list!)

/Per

On fri, 2005-07-01 at 10:48 +0200, HECKHAUSEN Ralf wrote:
> I have a problem with the parser, or better the tokenizer. Below is a
> simplified example.
>
> %header%
>
> GRAMMARTYPE = "LL"
>
> %tokens%
> LETTER = <<[A-Z]>>
> COMMAND = "GO"
> SPACE = " "
>
> %productions%
>
> INPUT = COMMAND SPACE LETTER+;
>
>
> The input "GO SOUTH" is parsed correctly, but with "GO SOGO" I get a
> parse error, because "GO" is not recognized as two letters anymore. It
> seems that whenever letters are defined as tokens, they may not occur
> in other context. But I am sure there is a way around this problem,
> can someone help me?
> 
> Cheers,
> Ralf

_______________________________________________
Grammatica-users mailing list
address@hidden
http://lists.nongnu.org/mailman/listinfo/grammatica-users

<<winmail.dat>>

[Prev in Thread]

Current Thread

[Next in Thread]

AW: [Grammatica-users] Tokenizer problem, HECKHAUSEN Ralf, 2005/07/01
- AW: [Grammatica-users] Tokenizer problem, HECKHAUSEN Ralf <=
  - Re: AW: [Grammatica-users] Tokenizer problem, Per Cederberg, 2005/07/01

Prev by Date: Re: [Grammatica-users] Fuzzy tokenizer.
Next by Date: Re: AW: [Grammatica-users] Tokenizer problem
Previous by thread: AW: [Grammatica-users] Tokenizer problem
Next by thread: Re: AW: [Grammatica-users] Tokenizer problem
Index(es):
- Date
- Thread