bug#17742: Acknowledgement (Support for enchant?)

On 19 December 2016 at 16:01, Eli Zaretskii <eliz@gnu.org> wrote:

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Sun, 18 Dec 2016 23:39:54 +0000
> Cc: 17742@debbugs.gnu.org
>
> I have not had any response to my enquiries yet, but I did some research, and neither GNU Aspell nor hunspell offer any way to get this information (about character classes of dictionaries) via their APIs.

They provide this information in the dictionaries, and we glean it
from there. See ispell-parse-hunspell-affix-file and
ispell-aspell-find-dictionary.

The dictionaries are not part of the API (even where the format is documented, the location may not be fixed), so it's not a good idea to rely on them.

Having discovered that Aspell does not provide this information (I checked again, and ispell-aspell-find-dictionary does not find this information in the dictionaries, except for limited information about otherchars; for casechars and not-casechars it defaults to [:alpha:]), I shall investigate with the hunspell maintainers.

Maybe there's a misunderstanding: I'm talking about the CASECHARS,
NOT-CASECHARS, and OTHERCHARS parts of the dictionary data in
ispell-dictionary-alist.

There's no misunderstanding here, that's what I'm talking about.

Each dictionary can (and many do) use some of the punctuation
characters in the words it can handle. A notable example is the
apostrophe ' in English, used for the various suffixes that spellers
support; similar features exist in other languages, but with possibly
different punctuation characters. Ispell.el must match that by using
the speller's notion of a word, which must be independent of the
current major mode's idea of what a word is. This is where these
character sets come into play, and I really cannot see how can
ispell.el work well without using them as it does now.

Currently, using casechars = [[:graph:]], if I put point over part of the string " (XP) ", and run M-x ispell-word, it says "(XP) is correct". That's good enough for me!

Note that merely using the characters declared in the dictionary may not be enough: I have words like SC³D (I spell my company that way) in my personal word lists. Other users might be more imaginative, and for example have sequences of emoji. The list of characters in the dictionary is only a minimum.

So we do need this information. If Enchant doesn't provide it, we
could still use the same technique as with Aspell and Hunspell,
provided that we can figure out which back end(s) is/are used by
Enchant. Is that doable?

Yes, that can be done, but it's fragile; that's why I'm trying to avoid it.

Ispell.el also supports spell-checking by words, in which case the
above is not useful, because we need to figure out what is a word.

See above. It's not clear to me that we need a very precise idea of what constitutes a word.

Moreover, even when we send entire lines to the speller, we want to
skip lines that include only non-word characters.

Why?

Just look at the

callers of the above-mentioned accessor functions, and you will see
how we use them.

I have read this code. I see how we use them; it's just not clear to me that it's necessary to use them thus.

Hunspell is the most modern and sophisticated speller, we certainly

don't want to degrade it.

No chance of that, this patch is only about Enchant.

Also, Aspell uses the dictionaries at least
for some of this info, see the function I pointed to above.

Only for otherchars, not casechars/not-casechars.

Bottom line, this information cannot be thrown away or ignored. It is
important for correctly interfacing with a dictionary and for doing
TRT as the users expect. Any modern speller program would benefit
from it, and therefore we should strive to provide such information to
ispell.el whenever we possibly can.

It is not a question of throwing away or ignoring information: the information is simply not available through documented channels (at least for Enchant). Yes, one can find the underlying engine and then use that information to (try to) find the dictionaries, but one is then making a number of brittle assumptions. And it's not clear that the information is actually necessary to have.

It would be helpful if you could show a situation in which using [:graph:] for enchant dictionaries. actually misbehaves in some way.

In fact, reading enchant's source code, it uses a fixed set of Unicode classes for its own internal equivalent of casechars. Using that would make sense (for Enchant! again, I'm not suggesting changing how we use hunspell).

One other data point: a senior LyX maintainer, Jean-Marc Lasgouttes, agrees with you:

https://github.com/AbiWord/enchant/issues/17#issuecomment-267924304

He says that LyX has a "bug open somewhere" that suggests using this information (but he didn't know it was available!).

http://rrt.sc3d.org

From:	Reuben Thomas
Subject:	bug#17742: Acknowledgement (Support for enchant?)
Date:	Mon, 19 Dec 2016 21:47:42 +0000