[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

A soundslike problem with combined English+Russian dictionary

From: Maxim Nikulin
Subject: A soundslike problem with combined English+Russian dictionary
Date: Tue, 22 Jun 2021 23:56:25 +0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1


I am aware that multi-lingual dictionaries are unsupported by Aspell, but I think in some particular cases it is still possible to combine a couple of dictionaries and to get a result of reasonable quality. I am almost achieved what I expected for merged English and Russian word lists. I am quite satisfied even with current result. Maybe I just have not discovered detrimental effect of missed affix table for English or combined special characters ("-" and "'").

I was hooked by description of the metaphone algorithm that should improve suggested corrections for misspelled words. Since I am not a native English speaker, I do not mind to have such feature if it helps to remind some word. For Russian general edit distance should be enough, so I tried to use a copy of en_phonet.dat with added line (and exact copy as well)

    remove_accents 0

that is referenced in the .dat file

    soundslike rue_phonet

To my surprise with such configuration whole English alphabet is suggested as a replacement for misspelled Russian word. In the following example word "funetik" is taken from the manual to check that phonetic rules are taken into account (another example taff -> tough does not work with default suggestion mode)

echo "funetik програма" | aspell -d ./rue.rws -a
@(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)
& funetik 26 0: fanatic, funk, fungi, Fuentes, functor, frenetic, genetic, 
kinetic, finite, fount, fungoid, funky, lunatic, phonetic, fountain, funked, Fundy, 
fined, founts, funded, font, fund, frantic, funkier, fount's, Fuentes's
& програма 100 8: программа, программ, A, B, C, D, E, F, G, H, I, J, K, L, M, 
N, O, P, Q, R, S, T, U, V, X, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, 
r, s, t, u, v, x, z, AA, AI, AR, Ar, Au, BA, BB, BO, Ba, Be, Bi, CA, CO, Ca, Ce, 
Ch, Ci, Co, Cu, DA, DD, DE, DI, Di, Du, Dy, ER, EU, Er, Eu, FY, Fe, GA, GE, GI, GU, 
Ga, Ge, HI, Ha, He, Ho, IA, IE, Ia, Io, Ir, Jo, KO, KY

That is why I have

     soundslike generic

is my current configuration and it gives more reasonable variants for Russian test word:

& програма 13 8: программа, программ, программе, программу, программы, 
программах, программам, программка, параграмма, программою, проиграна, параграмм, 

Have I done something wrong? Is it expected behavior that English phonetic rules have so detrimental effect on variants for Russian words? I am unsure whether observed result is a bug. (Actually the question is: `How many bugs have I faced?' With zero as a possible variant)

More details of my configuration.

The goal is to see misspelled words in mixed-language documents with my notes. Variants of correction are appreciated as well. It works in Vim for years:

    set spelllang=en,ru spell

and I would like to have comparable feature in Emacs

    M-x flyspell-mode RET M-x ispell-change-dictionary RET rue

without special configuration of custom dictionary in Emacs. Side note: certainly I am against idea, I have seen once, to bind ispell dictionary to input method.

There is a feature request for support of multi-lingual dictionaries
(and a number of similar threads in the archive of this mail list).
People are still trying to combine dictionaries:
There is no section in the manual that clarifies possible problems of this approach.

I hope, in my particular case of English and Russian languages it can be done in a bit more accurate way.

- I rarely use letters with accents, so alphabets are disjoint set of characters. US-ASCII is a subset of KOI8-R encoding. - The cost of discarding of affix data for Russian is ~30M of disk space (and almost certainly RAM as well). I am unsure if I loose something by ignoring affix table for English. - Combined "special" is a kind of compromise, it should be per-language, I have not example of imperfect behavior yet however. - As I said above, I would prefer phonetic rules for English but I have to use generic ones.

--->8--- rue.dat begin --->8---

# Combined dictionary for English and Russian languages
# An attempt to create a dictionary suitable for spell checking
# of mixed-language texts.
# Something distinct from just "ru" and "en". Do not use a name longer
# than 3 characters otherwise it will not appear in "aspell dump dicts"
# thus will be ignored by other applications. Numbers, e.g. "ru2"
# make language identifier invalid as well.
name            rue
# ISO8859-1 used for "en" dictionaries is a subset of KOI8-R
# modulo accents.
# Russian dictionary from system package on Ubuntu uses namely KOI8-R.
charset         koi8-r
# Combine values from "ru" and "en"
special         - -*- ' -*-
# With
#     soundslike rue
# and a copy of en_phonet.dat aspell suggests
# e.g. "phonetic" for "funetik" input.
# Unfortunately it ruins scoring of corrections for Russian.
# Even with "remove_accents 0" inside "rue_phonet.dat", abundant
# single- and two-letters variants appear as alternatives.
# However a couple of top rated suggestions are still reasonable.
# Segfault may happen on attempt to generate master dictionary
# when "rue_phonet.dat" is missed in the current directory.
# As a compromise, prefer better quality of correction variants
# for Russian.
soundslike      generic
# Affix compression is not enabled for "en" system dictionaries.
# At the same time it allows to save enough space for "ru" dictionary.
# Size of compressed dictionary is 3Mb, expanded one consumes 30Mb
# of disk space.
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | aspell --lang=ru create master ./ru.rws
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | tr ' ' '\n' \
#         | aspell --lang=ru create master ./ru-expand.rws
affix-compress  true
# Actually it is ignored and "rue_affix.dat"
# (copy or symlink is required).
affix           ru

# Noticed differences:
#     echo "programm funetic" | aspell --lang en -a
#     & programm 5 0: program, programs, programmer, programmed, program's
# & funetic 14 9: fanatic, frenetic, genetic, kinetic, lunatic, phonetic, frantic, fungi, Fuentes, antic, functor, fanatics, fungoid, fanatic's
#     # ------------------------------------------------------------^^^^^^^^
#     echo "programm funetic" | aspell --lang rue -a
# & programm 6 0: program, programs, programmed, programmer, program's, pogrom # #---------------------------------------------------------------------^^^^^^
#     & funetic 5 9: fanatic, genetic, kinetic, lunatic, Fuentes
# Absence of "phonetic" caused by "soundslike generic". "Pogrom" presents
# in the original "en" word list.

---8<--- rue.dat end   ---8<---

--->8--- rue.multi begin --->8---

# Combined dictionary for English and Russian languages
# It is not possible to just add ru.multi and en.multi
# because of languages
# inside the dictionaries differ. Unsure if it is safe to generate
# dictionary for English language using modified ru.dat
# with "special ' -*-".
# Let's generate dictionaries with "rue" as a language identifier.
# System-wide .rws files are created on Ubuntu in postinst scripts by
# /usr/sbin/update-dictcommon-aspell and /usr/sbin/aspell-autobuildhash
# utilities. Source word lists are provided
# in /usr/share/aspell directory.
# Example of command to unpack:
#     zcat /usr/share/aspell/en-wo_accents-only.cwl.gz | precat
# E.g. en_US dictionary is combination of en-common
# (shared with e.g. en_GB)
# and en-wo_accents-only. Unsure if I need this degree of word list
# granularity, so let's try a naive approach to create word lists.
# "rue_affix.dat" is required despite "affix ru" line in rue.dat
#     ln -s /usr/lib/aspell/ru_affix.dat rue_affix.dat
#     aspell --lang=ru --encoding=koi8-r dump master \
#          | aspell --lang=rue create master ./rue-ru.rws
# Despite warnings like
#     # Warning: Removing inapplicable affix 'H' from word Адель.
# expanded word list is the same as the original one.
add rue-ru.rws
# Specify encoding to avoid UTF-8 if some accents
# will appear accidentally.
#     aspell --lang=en_US --encoding=iso8859-1 dump master \
#          | aspell --lang=rue create master ./rue-en_US.rws
add rue-en_US.rws

---8<--- rue.multi end   ---8<---

Commands to generate word lists are in the last comments in rue.multi. Finally I can run

    aspell --lang rue -a

Does such configuration have apparent problems? Is it possible to use en_phonet.dat instead of "generic" for soundslike?

reply via email to

[Prev in Thread] Current Thread [Next in Thread]