lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] Strip markup for spell checking?


From: Vadim Zeitlin
Subject: Re: [lmi] Strip markup for spell checking?
Date: Thu, 29 Oct 2015 18:57:09 +0100

On Thu, 29 Oct 2015 15:12:04 +0000 Greg Chicares <address@hidden> wrote:

GC> It seems clear that 'hunspell' wins,

 Yes, I've somehow missed this. At least I didn't recommend using ispell...

GC> hunspell's '-H' also seems to remove <!-- comments -->, so I'll avoid
GC> that. And hunspell has widely-reported problems with apostrophes,

 This is really surprising, especially as there doesn't appear to be any
workaround.

GC> I come to this casual but useful command:
GC> 
GC> < /opt/lmi/src/lmi/nasd.xsl sed -e'/<[^!].*>/d' \
GC>   | hunspell -L | tr --delete "'" | hunspell | sed -e'/^&/!d' \
GC>   -e'/^& \(MEC\|Sep\|nbsp\|DOCTYPE\|stylesheet\|xsl\|[Cc]hicares\) /d'
...
GC>   [I would have suggested "adjuvant".]
[I, for one, am looking forward to your custom hunspell dictionary release]
...
GC> This is already immediately useful. Refinements to the command I
GC> cobbled together are welcome.

 The main refinement I see is to use a custom dictionary instead of a long
(and bound to get longer) sed expression at the end. Custom hunspell
dictionaries are very simple to create, basically if you have a wordlist
you can just do

        $ wc -l wordlist > custom.dic
        $ sort wordlist | uniq >> custom.dic

and this is how I created the attached xsl.dic.

 The next refinement is to use the undocumented "-u2" option which gives
the context for the misspelling and is IMHO rather useful.

 The final one is to filter out the "misspellings" expressing a length in
inches or points.

 Combining all this, I get the following command:

sed -e '/<[^!].*>/d' -e "s/'\([^']*\)'/\1/g" *.xsl | hunspell -d en_US,../xsl 
-u2 | sed 's/^[1-9][0-9]*//'|grep -vE '^s/(-|([0-9.]+)(in|pt).?)/'

(surprise: no Perl in sight). And the output is pretty good IMHO, there are
are a few domain-specific words that you probably want to add to the
custom dictionary ("inforce", "payor"), some US vs UK discrepancies (?)
that I'm leaving for you to handle ("advisor", "persistency") and just one
remaining weirdness ("prepared....now") -- but the rest are either
correctly spelt words running together or what looks like genuine spelling
errors to me.

 Please let me know if you see scope for any further improvements, but
AFAICS this is about as good as it gets if we limit ourselves to shell
one-liners, luckily they work well enough here. Now if you'd like me to do
write a command running spell check on the C/C++ comments and strings, then
I'd probably need Perl to do it...

 Regards,
VZ

Attachment: xsl.dic
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]