Re: [Varnamproject-discuss] Improving Varnam Learning

On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

Good to see that you are making progress.

On 3/3/14 12:58 PM, Kevin Martin wrote:
>> No it is prefixes. For example, when the word മലയാളം is learned, varnam
>> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
>> "malayali", it can easily tokenize it rather than typing like "malayaali".
>
> 1.What do you mean by tokenization? A token is a pattern to symbol mapping.
> So tokenization means matching the entire word to its malayalam symbol?

A tokenization is splitting the input into multiple tokens. For eg:

input - malayalam
tokens - [[ma], [la], [ya], [lam]]

Each will be a `vtoken` instance with relevant attributes set. For the
token `ma`, it will be marked as a consonant.

Tokenization happens left-right. It is a greedy tokenizer which find the
longest possible match. Look at `vst_tokenize` function to learn how it
works.

>
> 2. The porter stemmer stems the given English word to a base word by
> stripping it off all the suffixes. How can we stem a malayalam word?
> Suppose that varnam is encountering the word മലയാളം for the first time. The
> input was 'malayalam'. In this case, as of now, varnam learns to map 'mala'
> to മല, 'malaya' to മലയാ and so on? Hence learning a word makes varnam learn
> the mappings for all its prefixes, right?

Something like the following:

stem(അവളുടെ) = അവൾ
stem(കാറ്റിന്റെ) = കാറ്റ്
stem(ഡോ‍ക്ക്റ്ററുടെ) = ഡോ‍ക്ക്റ്റർ

>
> 3. Let me propose a stemmer that rips off suffixes. Consider the word
> മലയാളം (malayalam) that was learned by varnam.
> I think the goal of the stemmer should be to get the base word മലയാള
> (malayal) rather than മലയൽ. To do this, I think we will need to compare the
> obtained base word with the original word. Let us assume that the stemming
> algorithm got the base word 'malayal' from 'malayalam'. We can make sure
> that this is mapped to മലയാള rather than മലയൽ by ripping off the equivalent
> suffix from the malayalam transliteration word. That is,
>
> removing the suffix 'am' from 'malayalam' removes the ം from 'മലയാളം'. For
> this, 'am' needs should have been matched with ം in the scheme file. Hence
> we would get മലയാള for 'malayal' and this can be learned. This would result
> in the easier mapping of malayali to മലയാളി .
>
> Another example :
>
> thozhilalikalude is തൊഴിലാളികളുടെ
>
> a).sending 'thozhilalikalude' to the stemmer, we obtain 'thozhilalikal' in
> step 1. As a corresponding step ു ടെ is removed from തൊഴിലാളികളുടെ and
> results in തൊഴിലാളികള. No learning occurs in this step because we have not
> reached the base word yet.
> b) 'thozhilalikal' is stemmed to 'thozhilali' - കള is removed from
> തൊഴിലാളികള. Even though 'kal', the suffix that was removed, could be
> matched to കൽ, we do not do that because the word before stemming had ള.
> Produces തൊഴിലാളി .
> c) thozhilali is stemmed to thozhilal - Produces തൊഴിലാള from തൊഴിലാളി.
> This base word and the corresponding malayalam mapping is learned by varnam.
>
> I have not completed drafting the malayalam stemmer algorithm. It seems to
> have many more condition checks than I had anticipated and could end up
> being larger and more complicated than the porter stemmer. But before I
> proceed, I need to know whether the logic I presented above is correct.

You are on the right direction.

Stemming in Indian languages is really complex because of the way we
write words. So don't worry about getting 100% stemming. IMO, that is
impossible to achieve. So target for a stemming rules which will
probably give you more than 60-70% of success rate.

We should make this stemming rules configurable in the scheme file. So
in the malayalam scheme file, you define,

stem(a) = b

this gets compiled into the `vst` file and during runtime, `libvarnam`
will read the stemming rule from the `vst` file and apply it to the
target word.

As part of this, we also need to implement a sort conjunct rule to
`libvarnam` so that it know how to combine base form and a vowel. Dont'
worry about this now. We will deal with it later.

>
> regards,
>
> Kevin Martin Jose
>
> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <address@hidden> wrote:
>

> Hello Kevin,
>
> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>> I'm seeking to improve varnam's learning capabilities as a GSoC project.
>>>> I've gone through the source code and I have doubts. I need to clarify if
>>>> my line of thinking is right. Please have a look :
>>>>
>>>> 1) Token : A token is an indivisible word. A token is the basic building
>>>> block. 'tokens' is an object (instance? I mean the non-OOP equivalent of
> an
>>>> object) of the type varray. 'tokens' contain all the possible patterns
> of a
>>>> token? For example, മലയാളം മലയാളത്തിന്റെ മലയാളത്തിൽ മലയാള would all go
>>>> under the same varray instance 'tokens'?. And each word ( for eg മലയാളം )
>>>> would occupy a slot at tokens->memory I suppose. Am I right in this
> regard?
>
> No.
>
> In മലയാളം, മ will be a token. `varray` is a generic datastructure that
> can keep any elements and grow the storage as required. So
> `tokens->memory` will have the following tokens, മ, ല, യാ, ളം. Each
> token known about a pattern and a value.
>
> Look at the scheme file in "schemes/" directory. A token is a
> pattern-value mapping.
>
>
>>>>
>>>> 2) I see the data type 'v_' frequently used. However,I could not find its
>>>> definition! I missed it, of course. Running ctrl+f on a few source files
>>>> did not turn up the definitions. So I thought I would simply ask here! I
>>>> would be really grateful if you can tell me where it is defined and why
> it
>>>> is defined (what it does)
>
> That's a dirty hack. It's a define, done at[1]. It will get replaced as
> `handle->internal` by the compiler. It is just a shorthand for
> `handle->internal`. Not elegant, but got used to it. We will clean it up
> one day. Sorry for making the confusion.
>
> [1]:
>
> https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26
>
>>>>
>>>> 3) I read the porter stemmer algorithm. The ideas page say *"something
> like
>>>> a porter stemmer implementation but integrated into the varnam framework
> so
>>>> that new language support can be added easily"*. I really doubt if
>>>> implementing a porter stemmer would make adding new language support any
>>>> easier. The English stemmer is an improvised version of the original
> porter
>>>> stemmer. A stemming algorithm is specific to a particular language since
> it
>>>> deals with the suffixes that occur in that language. We need a malayalam
>>>> stemmer, and if we want to add support to say telugu one day, we would
> need
>>>> a telugu stemmer. We can of course write one stemmer and add test cases
> and
>>>> suffix condition checks in the new language so that tokenization can be
>>>> done with the same function call.
>
> When I said integrated into the framework, I mean make the stemmer
> configurable at a scheme file level. Basically the scheme file will have
> a way to define the stemming. Now when a new language is added, there
> will be a new scheme file and the stemming rules for that language goes
> to the appropriate scheme file. All varnam needs to know to properly
> evaluate those rules.
>
> I am in the process of writing some documentation explaining the scheme
> file and vst files. I will send you once it is done. It will make this
> much easy to understand.
>
>>>>
>>>>
>>>> 4) The ideas page say "Today, when a word is learned, varnam takes all
> the
>>>> possible prefixes into account". Prefixes? Shouldn't it be suffixes?
>
> No it is prefixes. For example, when the word മലയാളം is learned, varnam
> learns the prefixes, മല, മലയാ etc. So when it gets a pattern like
> "malayali", it can easily tokenize it rather than typing like "malayaali".
>
> Suffixes won't help because tokenization is left to right. This is where
> another major improvement could be possible in varnam. If we can come up
> with tokeniation algorithm, which takes, prefixes, suffixes and partial
> matches into account, then we literally can transliterate any word. But
> its a hard problem which needs lots of research and effort. The effort
> will be doing it at a scale at which varnam is operating now. Today,
> every key stroke that you make on the varnam editor, is searching over 7
> million patterns to predict the result. All this happens in less than a
> second. Improving tokenization and keeping the current performance is a
> *hard* problem.
>
>>>>
>>>> Let me try and coin a malayalam stemmer. I will post what I come up with
>>>> here.
>
> That's great. Feel free to ask any questions. You are already asking
> pretty good question. Good going.
>
>>>>
>>>> regards,
>>>>
>>>> Kevin Martin Jose
>>>>
>
>>
>>
>

- --
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTFVmtAAoJEHFACYSL7h6kRh0H/0IpLgfnTxf6Gc4m5uwUsQj5
Jyy9veta7L1uDj9bjoNuHDUlqllyJAxW6v95/nz+1i30nFplj+fM6LMWsg0g2zW5
uEEKVIVCforRzm3qCG/0gSuZ2eMaj8aRTycOdAMEKFyte5ZOUSFQ6mYkITzljy9d
z532i/5cNEkdgm/sEmSNjI8YxO9u29bX962wKWcaNtjFYMHSNmbEP8To5xuXGMDY
PF4GKa05s00LWhGXjLIik1QoL7iVFa3ezJJRUJjmKS4ea5HCBQOO9tOkeLRjhN9s
oAjCyNYFrEKrKImlIKrwoWmMx8T4hHxM6EcNeM2x157zrCtwhDE6YLuIktF9t+0=
=Hrdf
-----END PGP SIGNATURE-----

From:	Kevin Martin
Subject:	Re: [Varnamproject-discuss] Improving Varnam Learning
Date:	Tue, 4 Mar 2014 21:56:25 +0530