varnamproject-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Varnamproject-discuss] Improving Varnam Learning


From: Navaneeth K N
Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
Date: Mon, 10 Mar 2014 20:02:03 +0530
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Hello Kevin,

On 3/10/14 7:46 PM, Kevin Martin wrote:
> Thanks a lot. One final question before I start working on my proposal. The
> updated idea includes implementing concurrency learn. I think I will have
> to include that too on my proposal. Can you give me some more details? This
> is my guess :
> 
> Varnam learns one word at a time. So when a word is supplied to
> varnam_learn, it learns and stores it in sqlite - one at a time. To learn a
> new word, varnam_learn has to be called again. What we need is a queue,
> passed to varnam_learn as an argument, and a loop runs on this queue that
> learn and add the words in the queue one by one to the sqlite db. Am I
> right?

Problem is at the SQLite side. Since SQLite is a single file based
database, it locks the whole file when a transaction is in progress
disallowing any edits. This affects the way learn works essentially
making one learn at a time.

The problem happens when someone installing a client, let's say IBus
engine. After the installation, the whole word corpus will be fed into
varnam. Now this is a huge dataset and may take 10-20 minutes to fully
learn all the words. Now from the IBus engine, any calls to
`varnam_learn()` during this time will be errored because it can't get a
lock on the database since it is locked by the other process which is
feeding the initial training set.

To circumvent the above problem, detect the busy condition in
`varnam_learn()` and queue the word somehow. It would be writing the
word to a plain text file in a temporary location, use another DB file
etc. When the current transaction finishes, it checks the queue and
learn all the words queued. This way we don't error out for concurrent
`varnam_learn` calls.

This is not going to give any speed improvements because the learn still
happens sequentially. But this will improve the API usage experience as
user don't have retry some word when it fails.

> 
> 
> On Mon, Mar 10, 2014 at 10:27 AM, Navaneeth K N <address@hidden> wrote:
> 
> Hello,
> 
> On 3/9/14 4:59 PM, Kevin Martin wrote:
>>>> Perhaps I can help with in code documentation? In python we use
> docstrings
>>>> to state the purpose of each function. Is there a similar coding
> convention
>>>> in C? If so, I can write docstrings for learn.c and submit a PR. This
> will
>>>> also help me get more familiar with the code.
> 
> I think there is doxygen which can generate documentation from the
> source code. I haven't tried it myself but should work out of the box.
> 
> I think you have done enough preliminary work required to submit a
> proposal. You can probably work on your proposal and fix all these once
> you start the GSOC coding. We can also plan to improve the learning
> further by improving the tokenization. Will explain this in a new email.
> 
>>>>
>>>>
>>>> On Sun, Mar 9, 2014 at 4:54 PM, Kevin Martin <
> address@hidden>wrote:
>>>>
>>>>> I went through the bugs tracker. The issue with tokenization was fixed
> but
>>>>> the tracker was not closed yet. Are there any simple bugs I can fix
> before
>>>>> the GSoC application window closes?
>>>>>
>>>>>
>>>>> On Sun, Mar 9, 2014 at 1:56 PM, Navaneeth K N <address@hidden> wrote:
>>>>>
>>>> Hello Kevin,
>>>>
>>>> Thanks for the stemming rules. I didn't get time to review it
>>>> completely. But looks good so far.
>>>>
>>>> On 3/9/14 12:10 PM, Kevin Martin wrote:
>>>>>>>> A doubt regarding the vst file. It is an sqlite3 database file
> right? I
>>>>>>>> could not open it with 'sqlite3 ml.vst'. And the scheme file will be
>>>>>>>> compiled into a .vst file.
>>>>
>>>> yes. It is a SQLite file. You should be able to open it with the sqlite
>>>> utility. A scheme file can be compiled into VST file using `varnamc`.
>>>>
>>>>         varnamc --compile schemes/your_scheme_file
>>>>
>>>> For the following points, I will send you a detailed email later today.
>>>>
>>>> I would like to try out my stemming rules (check
>>>>>>>> attachments). Here's how I assume I should proceed :
>>>>>>>>
>>>>>>>> 1. Write the stemming rules into the scheme file.
>>>>>>>>
>>>>>>>> 2. Compile the scheme file. For this, stem(pattern) should match a
>>>>>>>> corresponding function right? Where should I specify that function?
>>>> Which
>>>>>>>> file specifies how the scheme file should be compiled?
>>>>>>>>
>>>>>>>> 3. For testing, I'd like to input a word, have it transliterated
> (using
>>>>>>>> varnam_transliterate), and THEN stemmed. This stemmed word is
> displayed
>>>> and
>>>>>>>> then passed to varnam_train so as that particular pattern is always
>>>> matched
>>>>>>>> to that word. And what is the difference between varnam_learn and
>>>>>>>> varnam_train?
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, I'm drafting a definitions_list - a file containing location of
>>>> the
>>>>>>>> definitions of a structure/function and what it does. It will not be
>>>> proper
>>>>>>>> documentation though. I have attached a sample with this mail. Its
>>>> really
>>>>>>>> helping me because after every 15 minutes I'll forget where a
> particular
>>>>>>>> structure/function was defined and then I'll start searching all the
>>>> source
>>>>>>>> codes. If you'd like to have this list I'll finish it soon and
> submit a
>>>> PR.
>>>>>>>>
>>>>>>>> Thank you for your time,
>>>>>>>>
>>>>>>>> Kevin Martin Jose
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 8, 2014 at 4:57 PM, Kevin Martin <
>>>> address@hidden>wrote:
>>>>>>>>
>>>>>>>>> I have drafted a set of stemming rules. The file is attached with
> this
>>>>>>>>> post. Please go through it.
>>>>>>>>>
>>>>>>>>> You were right in that it is impossible to achieve 100% stemming. I
>>>> took a
>>>>>>>>> malayalam paragraph and tried stemming the words. The main problem
> is
>>>> that
>>>>>>>>> in malayalam many words are compounded together and thus is
> difficult
>>>> to
>>>>>>>>> segregate. Also, the stemming rules I have provided does not mention
>>>> any
>>>>>>>>> specific order. Those rules will have to be applied in a specific
>>>> order to
>>>>>>>>> stem a given word. The English stemmer could do it without
> recursion,
>>>> and I
>>>>>>>>> think the malayalam stemmer could too - with the right ordering.
>>>>>>>>>
>>>>>>>>> There's a number assigned to each rule - the line number. So rule 3
>>>> refers
>>>>>>>>> to the statement written in line 3. I have tried to provide examples
>>>> where
>>>>>>>>> ever it seemed necessary.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 7, 2014 at 10:48 PM, Kevin Martin Jose <
>>>>>>>>> address@hidden> wrote:
>>>>>>>>>
>>>>>>>>>>   Thanks a lot.
>>>>>>>>>>  ------------------------------
>>>>>>>>>> From: Navaneeth K N <address@hidden>
>>>>>>>>>> Sent: ?07-?03-?2014 09:56
>>>>>>>>>> To: address@hidden
>>>>>>>>>> Subject: Re: [Varnamproject-discuss] Improving Varnam Learning
>>>>>>>>>>
>>>>>>>> Hello Kevin,
>>>>>>>>
>>>>>>>> On 3/5/14 12:12 AM, Kevin Martin wrote:
>>>>>>>>>>>> I went through the vst_tokenize() function. To my disappointment,
>>>>>>>>>>>> understanding it was not as easy as I thought. I wrestled for a
> few
>>>>>>>> hours
>>>>>>>>>>>> with code and decided that I need to assimilate a few key
> concepts
>>>>>>>> before I
>>>>>>>>>>>> can understand what vst_tokenize does.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. What is a vpool? Why is it needed? I read its definition but I
>>>> do not
>>>>>>>>>>>> understand its purpose or how it is used. Is it a pool of free
>>>> varrays?
>>>>>>>>>>>>    To be more specific, I would like to know the purpose of
> elements
>>>>>>>> like
>>>>>>>>>>>> v_->strings_pool. What does the function get_pooled_string()
>>>>>>>>>>>>    return?
>>>>>>>>
>>>>>>>> It is object pooling, a technique to reuse already allocated objects
>>>>>>>> rather than keep on reallocating them. This improves performance over
>>>>>>>> time because pool is destroyed only after the handle is destroyed.
>>>>>>>> Mostly you will have the handle available throughout the application.
>>>>>>>>
>>>>>>>> get_pooled_string() returns a strbuf, a dynamically growing string
> type.
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2.What is a vcache_entry? What is the purpose of the strbuf
>>>> 'cache_key'
>>>>>>>> in
>>>>>>>>>>>> vst_tokenize()? What are the contents of a vcache?
>>>>>>>>
>>>>>>>> Vcache is a hashtable. This is another optimization technique to
> reuse
>>>>>>>> already tokenized word. For eg: when "malayalam" is transliterated,
>>>>>>>> tokenization happens and cache gets filled with tokens. When it is
>>>>>>>> transliterated again, tokenization will just use the cache and won't
>>>>>>>> touch the disk. This improves performance dramatically.
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 3. What is the purpose of int tokenize_using and int match_type,
> the
>>>>>>>>>>>> parameters of vst_tokenize()?
>>>>>>>>
>>>>>>>> A tokenization can be of two types, pattern tokenization and value
>>>>>>>> tokenization. Pattern tokenization is about tokenizing words which
> you
>>>>>>>> will send for transliteration. Value tokenization is on the indic
> text.
>>>>>>>>
>>>>>>>> tokenize ("malayalam") = pattern tokenization
>>>>>>>> tokenize ("??????") = value tokenization
>>>>>>>>
>>>>>>>> To understand how tokenization works, you can use the `print-tokens`
>>>>>>>> tool available in the `tools` directory. It is not compiled usually.
> You
>>>>>>>> need to pass `-DBUILD_TOOLS=true` when doing `cmake .` to get it
>>>> compiled.
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 4. Assume that a malayalam stemmer ml_stemmer() has been
>>>> implemented.
>>>>>>>> Will
>>>>>>>>>>>> it replace vst_tokenize() or will the line :
>>>>>>>>>>>>
>>>>>>>>>>>>                           base=ml_stemmer(input)
>>>>>>>>>>>>
>>>>>>>>>>>>     be inside the vst_tokenize() function? The answer to this
>>>> question
>>>>>>>> must
>>>>>>>>>>>> be pretty straight forward but I cannot see it since I do not
>>>>>>>>>>>>     understand vst_tokenize() yet.
>>>>>>>>
>>>>>>>> Stemmer won't have connection to tokenization. Stemmer will be part
> of
>>>>>>>> learning subsystem. So `varnam_learn()` function will use it.
>>>>>>>>
>>>>>>>> Also stemmer has to be configurable for each language. You need to
> add a
>>>>>>>> new function to the scheme file compiler so that you can do something
>>>>>>>> like the following in each scheme file.
>>>>>>>>
>>>>>>>> stem ("????", "?")
>>>>>>>>
>>>>>>>> This rule needs to be compiled into `vst` file and during learn it
>>>>>>>> should be utilized to do the stemming.
>>>>>>>>
>>>>>>>> We may also need to fix how varnam combines two tokens. Currently,
> when
>>>>>>>> a consonant and a vowel comes together, varnam will render the
>>>>>>>> consonant-vowel form. But this is very basic and won't work for some
>>>>>>>> conditions where chill letters are involved. I will think about this
> and
>>>>>>>> draft the idea.
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 4, 2014 at 9:56 PM, Kevin Martin <
>>>>>>>> address@hidden>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you. I have a much better idea now. Another clarification
>>>> needed
>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> stem(????????????? )= ???????? or ????????
>>>>>>>>>>>>>
>>>>>>>>>>>>> Even though stemming it to ???????? makes more sense in
> malayalam,
>>>> it
>>>>>>>>>>>>> would be clearer to stem 'thozhilalikalude' to 'thozhilal'
>>>> (without the
>>>>>>>>>>>>> trailing 'i') in English. Hence IMO ??????? would be a better
> base
>>>> word
>>>>>>>>>>>>> than ????????. But the examples you provided in the previous
> mail
>>>>>>>> [given
>>>>>>>>>>>>> below] would hold.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [Examples from previous mail]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> stem(??????) = ???
>>>>>>>>>>>>>> stem(??????????) = ??????
>>>>>>>>>>>>>> stem(??????????????) = ???????????
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 4, 2014 at 10:12 AM, Navaneeth K N <address@hidden>
>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>> Hello Kevin,
>>>>>>>>>>>>
>>>>>>>>>>>> Good to see that you are making progress.
>>>>>>>>>>>>
>>>>>>>>>>>> On 3/3/14 12:58 PM, Kevin Martin wrote:
>>>>>>>>>>>>>>>>> No it is prefixes. For example, when the word ?????? is
>>>> learned,
>>>>>>>> varnam
>>>>>>>>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern
>>>> like
>>>>>>>>>>>>>>>>> "malayali", it can easily tokenize it rather than typing
> like
>>>>>>>>>>>> "malayaali".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.What do you mean by tokenization? A token is a pattern to
>>>> symbol
>>>>>>>>>>>> mapping.
>>>>>>>>>>>>>>>> So tokenization means matching the entire word to its
> malayalam
>>>>>>>> symbol?
>>>>>>>>>>>>
>>>>>>>>>>>> A tokenization is splitting the input into multiple tokens. For
> eg:
>>>>>>>>>>>>
>>>>>>>>>>>> input - malayalam
>>>>>>>>>>>> tokens - [[ma], [la], [ya], [lam]]
>>>>>>>>>>>>
>>>>>>>>>>>> Each will be a `vtoken` instance with relevant attributes set.
> For
>>>> the
>>>>>>>>>>>> token `ma`, it will be marked as a consonant.
>>>>>>>>>>>>
>>>>>>>>>>>> Tokenization happens left-right. It is a greedy tokenizer which
>>>> find the
>>>>>>>>>>>> longest possible match. Look at `vst_tokenize` function to learn
>>>> how it
>>>>>>>>>>>> works.
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. The porter stemmer stems the given English word to a base
>>>> word by
>>>>>>>>>>>>>>>> stripping it off all the suffixes. How can we stem a
> malayalam
>>>> word?
>>>>>>>>>>>>>>>> Suppose that varnam is encountering the word ?????? for the
>>>> first
>>>>>>>> time.
>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> input was 'malayalam'. In this case, as of now, varnam learns
>>>> to map
>>>>>>>>>>>> 'mala'
>>>>>>>>>>>>>>>> to ??, 'malaya' to ???? and so on? Hence learning a word
> makes
>>>>>>>> varnam
>>>>>>>>>>>> learn
>>>>>>>>>>>>>>>> the mappings for all its prefixes, right?
>>>>>>>>>>>>
>>>>>>>>>>>> Something like the following:
>>>>>>>>>>>>
>>>>>>>>>>>> stem(??????) = ???
>>>>>>>>>>>> stem(??????????) = ??????
>>>>>>>>>>>> stem(??????????????) = ???????????
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 3. Let me propose a stemmer that rips off suffixes. Consider
> the
>>>>>>>> word
>>>>>>>>>>>>>>>> ?????? (malayalam) that was learned by varnam.
>>>>>>>>>>>>>>>> I think the goal of the stemmer should be to get the base
> word
>>>> ?????
>>>>>>>>>>>>>>>> (malayal) rather than ????. To do this, I think we will need
> to
>>>>>>>> compare
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> obtained base word with the original word. Let us assume that
>>>> the
>>>>>>>>>>>> stemming
>>>>>>>>>>>>>>>> algorithm got the base word 'malayal' from 'malayalam'. We
> can
>>>> make
>>>>>>>> sure
>>>>>>>>>>>>>>>> that this is mapped to ????? rather than ???? by ripping off
> the
>>>>>>>>>>>> equivalent
>>>>>>>>>>>>>>>> suffix from the malayalam transliteration word. That is,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> removing the suffix 'am' from 'malayalam' removes the ? from
>>>>>>>> '??????'.
>>>>>>>>>>>> For
>>>>>>>>>>>>>>>> this, 'am' needs should have been matched with ? in the
> scheme
>>>> file.
>>>>>>>>>>>> Hence
>>>>>>>>>>>>>>>> we would get ????? for 'malayal' and this can be learned.
> This
>>>> would
>>>>>>>>>>>> result
>>>>>>>>>>>>>>>> in the easier mapping of malayali to ?????? .
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another example :
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> thozhilalikalude is ?????????????
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> a).sending 'thozhilalikalude' to the stemmer, we obtain
>>>>>>>> 'thozhilalikal'
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> step 1. As a corresponding step  ? ?? is removed from
>>>> ?????????????
>>>>>>>> and
>>>>>>>>>>>>>>>> results in ??????????. No learning occurs in this step
> because
>>>> we
>>>>>>>> have
>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> reached the base word yet.
>>>>>>>>>>>>>>>> b) 'thozhilalikal' is stemmed to 'thozhilali' - ?? is removed
>>>> from
>>>>>>>>>>>>>>>> ??????????. Even though 'kal', the suffix that was removed,
>>>> could be
>>>>>>>>>>>>>>>> matched to ??, we do not do that because the word before
>>>> stemming
>>>>>>>> had
>>>>>>>>>>>>  ?.
>>>>>>>>>>>>>>>> Produces ???????? .
>>>>>>>>>>>>>>>> c) thozhilali is stemmed to thozhilal - Produces ??????? from
>>>>>>>> ????????.
>>>>>>>>>>>>>>>> This base word and the corresponding malayalam mapping is
>>>> learned by
>>>>>>>>>>>> varnam.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have not completed drafting the malayalam stemmer
> algorithm.
>>>> It
>>>>>>>> seems
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> have many more condition checks than I had anticipated and
> could
>>>>>>>> end up
>>>>>>>>>>>>>>>> being larger and more complicated than the porter stemmer.
> But
>>>>>>>> before I
>>>>>>>>>>>>>>>> proceed, I need to know whether the logic I presented above
> is
>>>>>>>> correct.
>>>>>>>>>>>>
>>>>>>>>>>>> You are on the right direction.
>>>>>>>>>>>>
>>>>>>>>>>>> Stemming in Indian languages is really complex because of the
> way we
>>>>>>>>>>>> write words. So don't worry about getting 100% stemming. IMO,
> that
>>>> is
>>>>>>>>>>>> impossible to achieve. So target for a stemming rules which will
>>>>>>>>>>>> probably give you more than 60-70% of success rate.
>>>>>>>>>>>>
>>>>>>>>>>>> We should make this stemming rules configurable in the scheme
> file.
>>>> So
>>>>>>>>>>>> in the malayalam scheme file, you define,
>>>>>>>>>>>>
>>>>>>>>>>>>         stem(a) = b
>>>>>>>>>>>>
>>>>>>>>>>>> this gets compiled into the `vst` file and during runtime,
>>>> `libvarnam`
>>>>>>>>>>>> will read the stemming rule from the `vst` file and apply it to
> the
>>>>>>>>>>>> target word.
>>>>>>>>>>>>
>>>>>>>>>>>> As part of this, we also need to implement a sort conjunct rule
> to
>>>>>>>>>>>> `libvarnam` so that it know how to combine base form and a vowel.
>>>> Dont'
>>>>>>>>>>>> worry about this now. We will deal with it later.
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Feb 28, 2014 at 7:50 PM, Navaneeth K N <
> address@hidden>
>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello Kevin,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2/28/14 12:43 PM, Kevin Martin wrote:
>>>>>>>>>>>>>>>>>>> I'm seeking to improve varnam's learning capabilities as a
>>>> GSoC
>>>>>>>>>>>> project.
>>>>>>>>>>>>>>>>>>> I've gone through the source code and I have doubts. I
> need
>>>> to
>>>>>>>>>>>> clarify if
>>>>>>>>>>>>>>>>>>> my line of thinking is right. Please have a look :
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1) Token : A token is an indivisible word. A token is the
>>>> basic
>>>>>>>>>>>> building
>>>>>>>>>>>>>>>>>>> block. 'tokens' is an object (instance? I mean the non-OOP
>>>>>>>>>>>> equivalent of
>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> object) of the type varray. 'tokens' contain all the
> possible
>>>>>>>>>>>> patterns
>>>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>>>>>>> token? For example, ?????? ????????????? ?????????? ?????
>>>> would
>>>>>>>> all
>>>>>>>>>>>> go
>>>>>>>>>>>>>>>>>>> under the same varray instance 'tokens'?. And each word (
>>>> for eg
>>>>>>>>>>>> ?????? )
>>>>>>>>>>>>>>>>>>> would occupy a slot at tokens->memory I suppose. Am I
> right
>>>> in
>>>>>>>> this
>>>>>>>>>>>>>>>> regard?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In ??????, ? will be a token. `varray` is a generic
>>>> datastructure
>>>>>>>> that
>>>>>>>>>>>>>>>> can keep any elements and grow the storage as required. So
>>>>>>>>>>>>>>>> `tokens->memory` will have the following tokens, ?, ?, ??,
> ??.
>>>> Each
>>>>>>>>>>>>>>>> token known about a pattern and a value.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Look at the scheme file in "schemes/" directory. A token is a
>>>>>>>>>>>>>>>> pattern-value mapping.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2) I see the data type 'v_' frequently used. However,I
> could
>>>> not
>>>>>>>>>>>> find its
>>>>>>>>>>>>>>>>>>> definition! I missed it, of course. Running ctrl+f on a
> few
>>>>>>>> source
>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>> did not turn up the definitions. So I thought I would
> simply
>>>> ask
>>>>>>>>>>>> here! I
>>>>>>>>>>>>>>>>>>> would be really grateful if you can tell me where it is
>>>> defined
>>>>>>>> and
>>>>>>>>>>>> why
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> is defined (what it does)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That's a dirty hack. It's a define, done at[1]. It will get
>>>>>>>> replaced as
>>>>>>>>>>>>>>>> `handle->internal` by the compiler. It is just a shorthand
> for
>>>>>>>>>>>>>>>> `handle->internal`. Not elegant, but got used to it. We will
>>>> clean
>>>>>>>> it up
>>>>>>>>>>>>>>>> one day. Sorry for making the confusion.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>
> https://gitorious.org/varnamproject/libvarnam/source/68a17b6e2e5d114d6a606a9a47294917655a167f:util.h#L26
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 3) I read the porter stemmer algorithm. The ideas page say
>>>>>>>>>>>> *"something
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>> a porter stemmer implementation but integrated into the
>>>> varnam
>>>>>>>>>>>> framework
>>>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>>>> that new language support can be added easily"*. I really
>>>> doubt
>>>>>>>> if
>>>>>>>>>>>>>>>>>>> implementing a porter stemmer would make adding new
> language
>>>>>>>> support
>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>> easier. The English stemmer is an improvised version of
> the
>>>>>>>> original
>>>>>>>>>>>>>>>> porter
>>>>>>>>>>>>>>>>>>> stemmer. A stemming algorithm is specific to a particular
>>>>>>>> language
>>>>>>>>>>>> since
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>> deals with the suffixes that occur in that language. We
> need
>>>> a
>>>>>>>>>>>> malayalam
>>>>>>>>>>>>>>>>>>> stemmer, and if we want to add support to say telugu one
>>>> day, we
>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>>>>>> a telugu stemmer. We can of course write one stemmer and
> add
>>>> test
>>>>>>>>>>>> cases
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> suffix condition checks in the new language so that
>>>> tokenization
>>>>>>>> can
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> done with the same function call.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When I said integrated into the framework, I mean make the
>>>> stemmer
>>>>>>>>>>>>>>>> configurable at a scheme file level. Basically the scheme
> file
>>>> will
>>>>>>>> have
>>>>>>>>>>>>>>>> a way to define the stemming. Now when a new language is
> added,
>>>>>>>> there
>>>>>>>>>>>>>>>> will be a new scheme file and the stemming rules for that
>>>> language
>>>>>>>> goes
>>>>>>>>>>>>>>>> to the appropriate scheme file. All varnam needs to know to
>>>> properly
>>>>>>>>>>>>>>>> evaluate those rules.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am in the process of writing some documentation explaining
> the
>>>>>>>> scheme
>>>>>>>>>>>>>>>> file and vst files. I will send you once it is done. It will
>>>> make
>>>>>>>> this
>>>>>>>>>>>>>>>> much easy to understand.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 4) The ideas page say "Today, when a word is learned,
> varnam
>>>>>>>> takes
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> possible prefixes into account". Prefixes? Shouldn't it be
>>>>>>>> suffixes?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No it is prefixes. For example, when the word ?????? is
> learned,
>>>>>>>> varnam
>>>>>>>>>>>>>>>> learns the prefixes, ??, ???? etc. So when it gets a pattern
>>>> like
>>>>>>>>>>>>>>>> "malayali", it can easily tokenize it rather than typing like
>>>>>>>>>>>> "malayaali".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Suffixes won't help because tokenization is left to right.
> This
>>>> is
>>>>>>>> where
>>>>>>>>>>>>>>>> another major improvement could be possible in varnam. If we
> can
>>>>>>>> come up
>>>>>>>>>>>>>>>> with tokeniation algorithm, which takes, prefixes, suffixes
> and
>>>>>>>> partial
>>>>>>>>>>>>>>>> matches into account, then we literally can transliterate any
>>>> word.
>>>>>>>> But
>>>>>>>>>>>>>>>> its a hard problem which needs lots of research and effort.
> The
>>>>>>>> effort
>>>>>>>>>>>>>>>> will be doing it at a scale at which varnam is operating now.
>>>> Today,
>>>>>>>>>>>>>>>> every key stroke that you make on the varnam editor, is
>>>> searching
>>>>>>>> over 7
>>>>>>>>>>>>>>>> million patterns to predict the result. All this happens in
> less
>>>>>>>> than a
>>>>>>>>>>>>>>>> second. Improving tokenization and keeping the current
>>>> performance
>>>>>>>> is a
>>>>>>>>>>>>>>>> *hard* problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Let me try and coin a malayalam stemmer. I will post what
> I
>>>> come
>>>>>>>> up
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> here.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That's great. Feel free to ask any questions. You are already
>>>> asking
>>>>>>>>>>>>>>>> pretty good question. Good going.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Kevin Martin Jose
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
> 
>>
>>
> 

- -- 
Cheers,
Navaneeth
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - https://gpgtools.org

iQEcBAEBCgAGBQJTHczjAAoJEHFACYSL7h6kQCoH/iXMZ/V9hgOB74rwyiqppZvt
MyRxmoGLkvmLydsiklEslvJ/ZS6U5rMfO5A8LBeEvWEKIMKELBoHDPRSePIhdnVb
6czWzu60CV2d/d+0THPwQgVeOlg6zauxgOwUY54zpCc5/TsjwF7w6T9aD4EvAEXC
L2/lwoKfIt8f/COGkc5nOKq200N6EVBAimTSOkFcJvNCu6CyBojfN+O53noTIzU9
EP2sTwmWDj/OW/FsqyOzwF5Hh/4rz46kSEw3dXdKXNnH8vmmTXh1vaBDO1gJA/BF
mWVfOKed8upQLyn+b8iE51sjDgsOfO+bULBLjPlhD7Mr6cq1K2Kz1ua+a1ne6DA=
=A263
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]