tetum-translators
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Tetum-translators] Fw: Tetum wordlist "official"


From: Peter Gossner
Subject: [Tetum-translators] Fw: Tetum wordlist "official"
Date: Tue, 25 May 2004 14:57:17 +0930


Forwarded message:

Date: Mon, 24 May 2004 22:52:59 -0500
From: Kevin Patrick Scannell <address@hidden>
To: address@hidden
Subject: Re: Tetum wordlist "official"




Dear Pete,

I am now attaching several files for you to look over.
These files contain word lists with frequency counts
given for each word -- these counts are often helpful in
spotting errors.

A.toadd.txt:    main list of candidate words for addition to the spell checker;
                these words pass through all of the statistical filters.

A.toaddcap.txt: same as above, but with words appearing primarily in
                upper case in the corpus (and so probably all proper
                names of one kind or another).      

A.accent.txt: pairs of words that pass through the filters but differ
              only in presence or absence of one or more diacritical marks.

   
   You can proceed however you like, just go through and either
delete or correct any incorrect words that shouldn't be part
of the spell checker.  When you're done, send me the revised lists
and I'll do two things (1) package up the resulting clean word list
as an aspell package for you to test (2) rerun the statistical language
model using your changes and then generate deeper lists for
you to look over (if necessary).  

-Kevin


On 00:13 Tue 25 May     , Peter Gossner wrote:
> Wow Hi Kevin !
> Mate that is really impressive... FanBloodyTastic
> 'Scuse the strange mail format this seemed easier at the time...
> 
> <quote who=KevinPS>
> Pete,
> 
>    Great news -- your word list was more than sufficient
> and the crawler bootstrapped without a hitch -- it's grabbed
> more than 400,000 words of Tetum already.   
>   Here are the docs after just 24 hours online:
> 
>  http://borel.slu.edu/tet.html
> 
> Perhaps, since you already have this clean list of
> 5000+ words, I should just run the corpus through
> my filters and generate some candidate words for
> you to look over -- by the looks of things, without
> too much effort on your part we could easily have
> a working aspell-tet package sometime this week!
> </quote>
> <reply who=pete>
> 
> Have everything ready to go from ASPELL CVS here... (I think)
> The "other Kevin" (Aspell Kevin)  is building a demo from the original
> 5000 list:) I think I get it. Some of the soudslike mappings are going
> to be ... interesting. (tetum with an Au. accent ...lol  nah I wont )
> 
> I also now remember why I never bothered to learn C++ :)
> (how slow is the compiler !)
> 
> </reply>
> <quote who=KevinPS>
> Let me know and I'll fire off some lists to you.
> I guess since you aren't a native speaker this
> could be a bit harder than usual (i.e. maybe you'll
> need to consult dictionary occasionally) but the filters
> and the frequency counts I'll give you will make things
> much easier.
> </quote>
> <reply who=pete>
> Sounds really good.
> Please fire away... what else do I need ?
> Um 400 thousand might be a good place to STOP :)
> LOL. 
> </reply>
> <quote who=KevinPS>
> -Kevin
> 
> PS if you see any non-Tetum docs on the page above,
> or repeats, please let me know since they'll skew the
> stats a bit...
> </quote>
> 
> <reply who=pete>
> 
> Looks very clean to me. I will have another look tomorrow.
> I thought some may be Indonesian or Portuguese but they seem pretty
> clean. 
> Mate that's VERY impressive. I guess some may be duplicate
> content (the wiki entry seems to be everywhere for instance.. though not
> on your list which is also impressive... ) The code is Open Source? 
> 
> The original 5000 was fairly carefully gathered from "reputable /
> official sources" , looks like the few random pages I sampled where
> mostly pure "Dili -Tetum".. some Indonesian Borrow words but that is
> real life. That is that is how the real timorese use the language. So
> for a GP dictionary... Excellent !
> 
> I wish my tetum were that good but to me it seems great.
> I will CC some real tetum speakers and beta test the aspell dictionary
> (of course) with them and some written (paper) references, as well.
> 
> Kevin this is unbelievably great news ! I thought the 5000 was good :> !
> 
> Thankyou Thank you.
> 
> Pete
> </reply>
> 
> 
> -- 
> Todays fortune:
> Today is the first day of the rest of your life.
>      
> < http://www.gnu.org/software/tetum/ >
> < http://bigbutton.com.au/~gossner >
> < address@hidden >
> 

-- 

Attachment: ""
Description: Text document

Attachment: ""
Description: Text document

Attachment: ""
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]