tetum-translators
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Tetum-translators] Fw: Grammar checking


From: Peter Gossner
Subject: [Tetum-translators] Fw: Grammar checking
Date: Tue, 31 Aug 2004 15:07:59 +0930


Forwarded message:
Congrats to Kevin and great news for all we "minor" locale users.

Date: Fri, 27 Aug 2004 11:52:57 -0500
From: Kevin Patrick Scannell <address@hidden>
To: address@hidden
Subject: Grammar checking




Dear Pete,

Here is a general announcement about my grammar checker
that I'm sending out to all An Crúbadán contributors.
Something to keep in the back of your mind if you
are able to find some native language speakers as
volunteers...
Best
Kevin

***********************************************************************
*****    I'm writing first of all to thank you all for your help with my
corpus-building project An Crúbadán.  I'm pleased to say that the
crawler is now running for more than 160 languages and has gathered
minority language text corpora totaling over *213 million* words.
See http://borel.slu.edu/crubadan/stadas.html

   A special thanks goes to those of you who have undertaken the
tremendous task of editing the word lists output by the statistical
filters.  This work has resulted in the development of *eleven* new open
source spellchecking packages; for most of these languages there was
little in the way of language technology before we began: Azerbaijani,
Mongolian, Chichewa, Kinyarwanda, Tetum, Tagalog, Setswana, Malagasy,
Irish, Scottish, and Manx Gaelic.  Word lists for another 12 languages
are in the hands of editors as I write this.

   My main purpose in writing is to announce the release of
version 0.5 of my grammar checking software "An Gramadóir" in the hope
that some of you might be interested in attempting an implementation
for your native language.   The core engine is written in Perl
and therefore runs under Windows, Mac, Linux, etc.  It has full
support for Unicode and has interfaces for use with text editors like
emacs, vim, and OpenOffice.  You can read about more features here:
http://borel.slu.edu/gramadoir/index.html

If you're interested, please reply and let me know.  If there is enough
interest I may set up a mailing list at the project sourceforge site:
http://sourceforge.net/projects/gramadoir/

In any case, all I'll ask from you at this point is that you think
about the kinds of errors that a standard "word at a time" spell checker
for your language misses.   If you send me detailed descriptions of a
few of these errors (ideally with example sentences containing errors
and others with correct usage) I'll attempt to implement them, build a
language pack using the web crawler data, and send it to you for
testing. Here are some of the kinds of errors that are more-or-less
trivial for An Gramadóir to catch but are missed by programs like
aspell/ispell/myspell:

1. Context-sensitive spelling changes like requiring "an" before a vowel
(most of the time) in English: *"a apple", *"an dog".  Such rules are
very common in Irish and the current version of An Gramadóir implements
most of them.

2. Doubled words *"I went to to the store."

3. Unusual words "hiding" common misspellings.
  *"I cant go to the store", *"I like yor shirt"

4. Words appearing only in set phrases.  *"on lieu of"

More interesting are rules requiring some understanding of parts of
speech, gender, e.g. in French *"à le carte", *"la monde", or number
agreement between noun/verb, noun/adjective, etc.  Such rules might
involve assigning part-of-speech tags to your word list.  While this
will involve a certain amount of unpleasant labor, it will surely pay
off in the long run as robust part-of-speech tagging is an absolute
prerequisite for more advanced NLP asks such as parsing or machine
translation.

   The point I'd like to emphasize most strongly is that this is
a tool designed in various ways for use by language communities who
are facing the development of language technology with severely limited
resources.   This means first of all a *commitment to open source* as
the only viable way of harnessing the collective effort needed to
develop such complex and data-intensive software from scratch.  In the
specific context of natural language processing, it also means a
corpus-based/statistical approach e.g. using corpus data to develop a
lexicon as many of you have done with An Crúbadán, or, at a deeper
level, using An Gramadóir to learn part-of-speech tagging rules from
untagged text (using an algorithm of Eric Brill).  I also have scripts
for finding "dangerous" low frequency words (like the "yor", "cant"
examples above), examples of legal doubled words (not uncommon in many
languages), and much more.

   Thanks for taking the time to read all of this and I hope
to hear from many of you soon.

-Kevin
-- 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]