[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-devel] Re: Aspell and
From: |
Mike C. Fletcher |
Subject: |
[aspell-devel] Re: Aspell and |
Date: |
Mon, 21 Oct 2002 18:24:35 -0400 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826 |
Sorry if I caused offense by using your code w/out notifying you. I
didn't really think you'd be interested in a project that's so early in
it's life-cycle (I only just set up the SourceForge project in the last
hour). I cancelled a message I wrote to you Saturday night because I
figured you'd be too busy to be answering questions from the likes of me
before I have anything that's actually working :) .
As for compiling Aspell on Win32, I hadn't tried the MingWin32 version
of GCC. I had noticed the post about the VC++ compilation patch, but
your comment on it seemed to suggest that it would require quite a bit
of work to be acceptable. Given that I have no great C/C++ skill, it is
easier for me to build the infrastructure in Python and only use C/C++
for a few key algorithms than it is to try and modify a complex C/C++
project.
Too bad about using the *.rws files directly, but in considering it, I'm
leaning toward giving (GUI) tools to both dictionary creators and users
for generating redistributable files for both dictionaries and
word-sets. From the sound of it, it should be easy to allow users to
generate distributables for either system. If they have aspell
installed we'll offer the word-list-(de)compress functionality,
otherwise I'll only accept/generate uncompressed lists.
I am somewhat at a loss for how you access the "compressed" files. I'd
thought they were using a b-tree or similar index, but it doesn't seem
that way when I look at the code for word-list-compress. Are you
loading the whole word-set into memory? That should make it fast, but
doesn't it consume a lot of space? I'm currently using bsddb tables on
disk, with an in-memory hash-table implementation for temporary
word-sets (such as per-document and per-application sets).
I'll have to look at the typo-weighting code, as I'm not sure where to
hook it into the leditdistance algorithm. It would seem that you'd need
each "swap" to be a lookup into the typo table. I'm looking at making a
set of ranking algos based on:
set meta-data
user-specific sets have higher rank than system sets
dictionaries declare set's "commonality" ranking (e.g. the
english dict has levels 10,20,...90)
might allow for "formality" rankings (e.g. slang word-sets have
lower ranking in Business dictionaries and higher in Informal
dictionaries). Similarly "technicality", "political correctness" or
whatever key you want. Made a float factor, sets which don't include
the meta-data just get the default values. Each dictionary would then
include the set meta-data to determine the ranking of suggestions within
itself. Most likely would use a single float value at run time
(basically the product of the various set weightings)
frequency tracking
individual user's word-frequency tracking (optional). If it's
tracked, may as well use it.
individual user's typo-frequency tracking (optional). It might
be useful to track the frequency of typos for a given user to generate
the weightings (i.e. if a correction is reported, increment the diff (i
-> o) frequency record as well as the whole-word correction's frequency
record).
Anyway, rather than blathering on at you, suppose I'll do some more work
now. Have fun,
Mike
Kevin Atkinson wrote:
[CC to Aspell-devel for a public record of our conversation, please
continue to do so unless you have a good reason not to.]
I was browsing though Usenet groups on the search term "Aspell" as I do
from time to time to see what out people are saying about Aspell and I
came across your thread "Spell-check engine?" to comp.lang.python.
Although the LGPL gives you the right to reuse my code I would
of appreciate a note to that effort. You could of saved yourself a
decent deal of effort by contacting me first.
A few points I want to address:
The Aspell library should compile on Win32 using the MinGW version of Gcc
which means that the CygWin library does not need to be pulled in. It can
now also compile using VC++ but with a user contributed patch but that is
completely unsupported by me.
Do not even think about using the *.rws files as it is a compiled
dictionary format internal to Aspell and can change at any time. For
example the next Aspell release 0.51 will change the format of the
compiled words lists in a non trivial way. However using the *.cwl is
rather easy. All the *.cwl are just compressed word lists with the
word-list-compress utility distributed with Aspell. The process is
extremely simple and can easy be written in any language.
When edit distance are computed each "edit" has a weight associated with
it. When typo analysis is used the weights are significantly different
from the normal edit distance algorithm. The basic algorithm is the same
however.
If you have any other questions I will be happy to address them.
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/