[aspell-devel] Re: Aspell and

aspell-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Re: Aspell and

From:	Mike C. Fletcher
Subject:	[aspell-devel] Re: Aspell and
Date:	Mon, 21 Oct 2002 18:24:35 -0400
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826

Sorry if I caused offense by using your code w/out notifying you. Ididn't really think you'd be interested in a project that's so early init's life-cycle (I only just set up the SourceForge project in the lasthour). I cancelled a message I wrote to you Saturday night because Ifigured you'd be too busy to be answering questions from the likes of mebefore I have anything that's actually working :) .

As for compiling Aspell on Win32, I hadn't tried the MingWin32 versionof GCC. I had noticed the post about the VC++ compilation patch, butyour comment on it seemed to suggest that it would require quite a bitof work to be acceptable. Given that I have no great C/C++ skill, it iseasier for me to build the infrastructure in Python and only use C/C++for a few key algorithms than it is to try and modify a complex C/C++project.

Too bad about using the *.rws files directly, but in considering it, I'mleaning toward giving (GUI) tools to both dictionary creators and usersfor generating redistributable files for both dictionaries andword-sets. From the sound of it, it should be easy to allow users togenerate distributables for either system. If they have aspellinstalled we'll offer the word-list-(de)compress functionality,otherwise I'll only accept/generate uncompressed lists.

I am somewhat at a loss for how you access the "compressed" files. I'dthought they were using a b-tree or similar index, but it doesn't seemthat way when I look at the code for word-list-compress. Are youloading the whole word-set into memory? That should make it fast, butdoesn't it consume a lot of space? I'm currently using bsddb tables ondisk, with an in-memory hash-table implementation for temporaryword-sets (such as per-document and per-application sets).

I'll have to look at the typo-weighting code, as I'm not sure where tohook it into the leditdistance algorithm. It would seem that you'd needeach "swap" to be a lookup into the typo table. I'm looking at making aset of ranking algos based on:


   set meta-data
       user-specific sets have higher rank than system sets

dictionaries declare set's "commonality" ranking (e.g. theenglish dict has levels 10,20,...90)might allow for "formality" rankings (e.g. slang word-sets havelower ranking in Business dictionaries and higher in Informaldictionaries). Similarly "technicality", "political correctness" orwhatever key you want. Made a float factor, sets which don't includethe meta-data just get the default values. Each dictionary would theninclude the set meta-data to determine the ranking of suggestions withinitself. Most likely would use a single float value at run time(basically the product of the various set weightings)frequency trackingindividual user's word-frequency tracking (optional). If it'stracked, may as well use it.individual user's typo-frequency tracking (optional). It mightbe useful to track the frequency of typos for a given user to generatethe weightings (i.e. if a correction is reported, increment the diff (i-> o) frequency record as well as the whole-word correction's frequencyrecord).

Anyway, rather than blathering on at you, suppose I'll do some more worknow. Have fun,

Mike

Kevin Atkinson wrote:

[CC to Aspell-devel for a public record of our conversation, pleasecontinue to do so unless you have a good reason not to.]
I was browsing though Usenet groups on the search term "Aspell" as I dofrom time to time to see what out people are saying about Aspell and Icame across your thread "Spell-check engine?" to comp.lang.python.
Although the LGPL gives you the right to reuse my code I wouldof appreciate a note to that effort. You could of saved yourself adecent deal of effort by contacting me first.
A few points I want to address:
The Aspell library should compile on Win32 using the MinGW version of Gccwhich means that the CygWin library does not need to be pulled in. It cannow also compile using VC++ but with a user contributed patch but that iscompletely unsupported by me.
Do not even think about using the *.rws files as it is a compiled
dictionary format internal to Aspell and can change at any time.  For
example the next Aspell release 0.51 will change the format of the
compiled words lists in a non trivial way.  However using the *.cwl is
rather easy.  All the *.cwl are just compressed word lists with the
word-list-compress utility distributed with Aspell.  The process is
extremely simple and can easy be written in any language.
When edit distance are computed each "edit" has a weight associated withit. When typo analysis is used the weights are significantly differentfrom the normal edit distance algorithm. The basic algorithm is the samehowever.
If you have any other questions I will be happy to address them.

_______________________________________
 Mike C. Fletcher
 Designer, VR Plumber, Coder
 http://members.rogers.com/mcfletch/

[Prev in Thread]

Current Thread

[Next in Thread]

[aspell-devel] Aspell and, Kevin Atkinson, 2002/10/21
- [aspell-devel] Re: Aspell and, Mike C. Fletcher <=
  - [aspell-devel] Re: Aspell and, Kevin Atkinson, 2002/10/21
    - [aspell-devel] Re: Aspell and, Mike C. Fletcher, 2002/10/21
    - [aspell-devel] Re: Aspell and Python, Kevin Atkinson, 2002/10/21

Prev by Date: [aspell-devel] Aspell and
Next by Date: [aspell-devel] Re: Aspell and
Previous by thread: [aspell-devel] Aspell and
Next by thread: [aspell-devel] Re: Aspell and
Index(es):
- Date
- Thread