grammatica-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Grammatica-users] Want to Use the PArser as non-deterministic for N


From: Matti Katila
Subject: Re: [Grammatica-users] Want to Use the PArser as non-deterministic for Natural Language Processing
Date: Tue, 12 Jul 2005 17:08:06 +0300 (EEST)

Hi Andres,

On Fri, 8 Jul 2005, Andres Hohendahl wrote:
> I am working in natural language processing (personal) project, to play
> around with syntactic, semantic and morphologic processing.

You have a big project I would say :)

> I want also to parse several “part-of-speech” segments for NL in order to
> get a correct grammar testing using EBNF and C# under .NET framework.
>
> There are lots of mutual excluding parts when defining the different
> “tokens” as words, and

I don't think it's much use for many different "word" tokens. Just slice
the text to sentences, then words and punctuation marks.
Then all the fun starts.

> the dictionary is not able nor practical to be loaded
> as EBNF,

True, and you even can not use grammatica with too rich grammar since
generated parser might expand out of 64K which is the maximum class size
with java (oh' I wouldn't count C# would still work fine =)

> also the natural grammar is heavily context or inter-token
> dependant,

Do you have a good word database?

Like if there is a input "A cat has a hat." it would match for Subject
Predicate Object pattern.

(sorry of my knowledge with spoken languages and right terms)
A,a = noun(alphabet), or adverb
cat = noun
hat = noun
has = verb

The token stream for such thing would be:
word(noun, adverb), word(noun), word(verb), word(noun, adverb) and word(noun)

Well, it sounds like you would need a context sensitive tokenizer where
different possibilities are tried to match for token stream.

> To allow this (I guess) I must make the tokenizer somewhat context-dependent
> and tokenize several alternate ways using a recursive pattern scanning,
> allowing it to explore the combinations or word-functions that best fits a
> production.
>
> I think this can be done adding a structure-layer on top of the Token /
> Tokenizer classes, producing a callback or event to allow external classes
> and methods to operate and get the context data for this token, and finally
> there must be a trial-error or scoring to select the most appropriate token
> which fulfills the production(s).
>
> I have already successfully coded several classes class which checks the
> functions of a word as a set of types, using affix-reduction, dictionary
> seek and intelligent de-stemming.
>
> Any suggestion or clue?

I couldn't follow the idea in the last three paragraphs but I wish you a
best luck with your project. Maybe I could understand if you provide some
examples.


   -Matti




reply via email to

[Prev in Thread] Current Thread [Next in Thread]