|Subject:||Re: [Varnamproject-discuss] Improving Varnam Learning|
|Date:||Tue, 4 Mar 2014 21:56:25 +0530|
-----BEGIN PGP SIGNED MESSAGE-----Good to see that you are making progress.
A tokenization is splitting the input into multiple tokens. For eg:
On 3/3/14 12:58 PM, Kevin Martin wrote:
>> No it is prefixes. For example, when the word മലയാളം is learned, varnam
>> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
>> "malayali", it can easily tokenize it rather than typing like "malayaali".
> 1.What do you mean by tokenization? A token is a pattern to symbol mapping.
> So tokenization means matching the entire word to its malayalam symbol?
input - malayalam
tokens - [[ma], [la], [ya], [lam]]
Each will be a `vtoken` instance with relevant attributes set. For the
token `ma`, it will be marked as a consonant.
Tokenization happens left-right. It is a greedy tokenizer which find the
longest possible match. Look at `vst_tokenize` function to learn how it
Something like the following:
> 2. The porter stemmer stems the given English word to a base word by
> stripping it off all the suffixes. How can we stem a malayalam word?
> Suppose that varnam is encountering the word മലയാളം for the first time. The
> input was 'malayalam'. In this case, as of now, varnam learns to map 'mala'
> to മല, 'malaya' to മലയാ and so on? Hence learning a word makes varnam learn
> the mappings for all its prefixes, right?
stem(അവളുടെ) = അവൾ
stem(കാറ്റിന്റെ) = കാറ്റ്
stem(ഡോക്ക്റ്ററുടെ) = ഡോക്ക്റ്റർ
You are on the right direction.
> 3. Let me propose a stemmer that rips off suffixes. Consider the word
> മലയാളം (malayalam) that was learned by varnam.
> I think the goal of the stemmer should be to get the base word മലയാള
> (malayal) rather than മലയൽ. To do this, I think we will need to compare the
> obtained base word with the original word. Let us assume that the stemming
> algorithm got the base word 'malayal' from 'malayalam'. We can make sure
> that this is mapped to മലയാള rather than മലയൽ by ripping off the equivalent
> suffix from the malayalam transliteration word. That is,
> removing the suffix 'am' from 'malayalam' removes the ം from 'മലയാളം'. For
> this, 'am' needs should have been matched with ം in the scheme file. Hence
> we would get മലയാള for 'malayal' and this can be learned. This would result
> in the easier mapping of malayali to മലയാളി .
> Another example :
> thozhilalikalude is തൊഴിലാളികളുടെ
> a).sending 'thozhilalikalude' to the stemmer, we obtain 'thozhilalikal' in
> step 1. As a corresponding step ു ടെ is removed from തൊഴിലാളികളുടെ and
> results in തൊഴിലാളികള. No learning occurs in this step because we have not
> reached the base word yet.
> b) 'thozhilalikal' is stemmed to 'thozhilali' - കള is removed from
> തൊഴിലാളികള. Even though 'kal', the suffix that was removed, could be
> matched to കൽ, we do not do that because the word before stemming had ള.
> Produces തൊഴിലാളി .
> c) thozhilali is stemmed to thozhilal - Produces തൊഴിലാള from തൊഴിലാളി.
> This base word and the corresponding malayalam mapping is learned by varnam.
> I have not completed drafting the malayalam stemmer algorithm. It seems to
> have many more condition checks than I had anticipated and could end up
> being larger and more complicated than the porter stemmer. But before I
> proceed, I need to know whether the logic I presented above is correct.
Stemming in Indian languages is really complex because of the way we
write words. So don't worry about getting 100% stemming. IMO, that is
impossible to achieve. So target for a stemming rules which will
probably give you more than 60-70% of success rate.
We should make this stemming rules configurable in the scheme file. So
in the malayalam scheme file, you define,
stem(a) = b
this gets compiled into the `vst` file and during runtime, `libvarnam`
will read the stemming rule from the `vst` file and apply it to the
As part of this, we also need to implement a sort conjunct rule to
`libvarnam` so that it know how to combine base form and a vowel. Dont'
worry about this now. We will deal with it later.
> Kevin Martin Jose
> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden> wrote:
iQEcBAEBCgAGBQJTFVmtAAoJEHFACYSL7h6kRh0H/0IpLgfnTxf6Gc4m5uwUsQj5> Hello Kevin,
> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>> I'm seeking to improve varnam's learning capabilities as a GSoC project.
>>>> I've gone through the source code and I have doubts. I need to clarify if
>>>> my line of thinking is right. Please have a look :
>>>> 1) Token : A token is an indivisible word. A token is the basic building
>>>> block. 'tokens' is an object (instance? I mean the non-OOP equivalent of
>>>> object) of the type varray. 'tokens' contain all the possible patterns
> of a
>>>> token? For example, മലയാളം മലയാളത്തിന്റെ മലയാളത്തിൽ മലയാള would all go
>>>> under the same varray instance 'tokens'?. And each word ( for eg മലയാളം )
>>>> would occupy a slot at tokens->memory I suppose. Am I right in this
> In മലയാളം, മ will be a token. `varray` is a generic datastructure that
> can keep any elements and grow the storage as required. So
> `tokens->memory` will have the following tokens, മ, ല, യാ, ളം. Each
> token known about a pattern and a value.
> Look at the scheme file in "schemes/" directory. A token is a
> pattern-value mapping.
>>>> 2) I see the data type 'v_' frequently used. However,I could not find its
>>>> definition! I missed it, of course. Running ctrl+f on a few source files
>>>> did not turn up the definitions. So I thought I would simply ask here! I
>>>> would be really grateful if you can tell me where it is defined and why
>>>> is defined (what it does)
> That's a dirty hack. It's a define, done at. It will get replaced as
> `handle->internal` by the compiler. It is just a shorthand for
> `handle->internal`. Not elegant, but got used to it. We will clean it up
> one day. Sorry for making the confusion.
>>>> 3) I read the porter stemmer algorithm. The ideas page say *"something
>>>> a porter stemmer implementation but integrated into the varnam framework
>>>> that new language support can be added easily"*. I really doubt if
>>>> implementing a porter stemmer would make adding new language support any
>>>> easier. The English stemmer is an improvised version of the original
>>>> stemmer. A stemming algorithm is specific to a particular language since
>>>> deals with the suffixes that occur in that language. We need a malayalam
>>>> stemmer, and if we want to add support to say telugu one day, we would
>>>> a telugu stemmer. We can of course write one stemmer and add test cases
>>>> suffix condition checks in the new language so that tokenization can be
>>>> done with the same function call.
> When I said integrated into the framework, I mean make the stemmer
> configurable at a scheme file level. Basically the scheme file will have
> a way to define the stemming. Now when a new language is added, there
> will be a new scheme file and the stemming rules for that language goes
> to the appropriate scheme file. All varnam needs to know to properly
> evaluate those rules.
> I am in the process of writing some documentation explaining the scheme
> file and vst files. I will send you once it is done. It will make this
> much easy to understand.
>>>> 4) The ideas page say "Today, when a word is learned, varnam takes all
>>>> possible prefixes into account". Prefixes? Shouldn't it be suffixes?
> No it is prefixes. For example, when the word മലയാളം is learned, varnam
> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
> "malayali", it can easily tokenize it rather than typing like "malayaali".
> Suffixes won't help because tokenization is left to right. This is where
> another major improvement could be possible in varnam. If we can come up
> with tokeniation algorithm, which takes, prefixes, suffixes and partial
> matches into account, then we literally can transliterate any word. But
> its a hard problem which needs lots of research and effort. The effort
> will be doing it at a scale at which varnam is operating now. Today,
> every key stroke that you make on the varnam editor, is searching over 7
> million patterns to predict the result. All this happens in less than a
> second. Improving tokenization and keeping the current performance is a
> *hard* problem.
>>>> Let me try and coin a malayalam stemmer. I will post what I come up with
> That's great. Feel free to ask any questions. You are already asking
> pretty good question. Good going.
>>>> Kevin Martin Jose
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org
-----END PGP SIGNATURE-----
|[Prev in Thread]||Current Thread||[Next in Thread]|