bug#17742: Acknowledgement (Support for enchant?)

On 20 December 2016 at 15:40, Eli Zaretskii <eliz@gnu.org> wrote:

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Mon, 19 Dec 2016 21:47:42 +0000
> Cc: 17742@debbugs.gnu.org
>
> neither GNU Aspell nor hunspell offer any way to get this information (about character classes of dictionaries) via their APIs.
>
> They provide this information in the dictionaries, and we glean it
> from there. See ispell-parse-hunspell-affix-file and
> ispell-aspell-find-dictionary.
>
> The dictionaries are not part of the API (even where the format is documented, the location may not be fixed), so it's not a good idea to rely on them.

If there's no better way, then I see no problem in relying on the
dictionaries, and de-facto the results are satisfactory.

Agreed.

> Having discovered that Aspell does not provide this information (I checked again, and ispell-aspell-find-dictionary does not find this information in the dictionaries, except for limited information about otherchars; for casechars and not-casechars it defaults to [:alpha:]), I shall investigate with the hunspell maintainers.

Aspell provides some of that, and there's no reason to ignore what it
does provide.

Agreed.

Whether it's good enough depends on the dictionary and on what "(XP)"
means. It could be that "(XP)", including the parentheses, is a word
the dictionary recognizes, something akin to "(C)", i.e. copyright
sign.

Thanks, that's a good example.

So, if "(C)" is in the dictionary, then with [:graph:] as casechars, if I run ispell-word with point anywhere in "(C)", Emacs will send "(C)", and it will come back as correct. If casechars were only [:alpha:], then Emacs would send "C", and it would come back as wrong.

Conversely, if "C" is in the dictionary, then if I run ispell-word with casechars set to [:graph:] then Emacs will send "(C)" and it will come back as correct (because Hunspell will ignore the non-wordchars characters). It would also work with casechars set to [:alpha:].

So with casechars set to [:graph:], there's no false positive or false negative.

I don't see why it would be fragile with Enchant when it isn't with
its back-ends.

Because there's no guarantee that Enchant will continue to use backends in the same way as at present.

And avoiding even fragile methods is worse than using
them, when there's no better way of gleaning the same information, and
the information is important (as it is in this case).

Agreed.

I think you are drawing too radical conclusions from trying that with
a single word and a single dictionary. Which string was sent to the
speller in this case,

"(XP)"

and is that the string you expected to be sent?

I don't have strong feelings about that.

> Moreover, even when we send entire lines to the speller, we want to
> skip lines that include only non-word characters.
>
> Why?

To avoid false positives and false negatives, as explained above.

But such characters will be ignored by the spellchecker (unless perhaps they occur in the personal word list). So I'm not sure how they would generate false positives or negatives.

First, Enchant could be using Hunspell as its engine, right?

Sure.

And second, AFAIU this discussion started by you proposing to get rid
of CASECHARS etc., for all spellers, not just for Enchant, something
that will definitely cause degradation.

I didn't mean to propose that. I'm sorry if I gave that impression. I'm just saying I don't want to put in the work now to add that support for Enchant. I have not changed (and do not propose to change) the support for Hunspell.

It sounds like the important part of our disagreement is in the last
sentence. If so, I hope I've succeeded to change your mind. Failing
that, all I can suggest is to study the spelling rules of modern
speller, such as Hunspell, and see how this information is used there.

As I already said, Hunspell does not provide this information to applications. So consumers of Hunspell have two choices:

1. Use side channels (as Emacs does).

2. Have some arbitrary idea of what constitutes a word.

The fact that an API to get the wordchars from hunspell is only now being considered for addition suggests to me that neither the maintainers of hunspell nor the developers of hunspell-using programs have thought this was particularly important.

I tried to explain that above: you will get falses and/or irrelevant
or missing corrections from the speller. For example, if you send
"foo.bar", and the speller doesn't support '.' as a word-constituent
character, you will get separate suggestions for "foo" and "bar", and
won't get "foobar".

What happens at the moment (with my Enchant patch) is I get the error "Ispell and its process have different character maps". I wouldn't expect "foobar" in any case, if "." is not a constituent character, though I might be surprised to get a correction for a word I thought I wasn't pointing at (but I could be surprised in this way in any case, if the dictionary has a surprising set of wordchars).

I also don't understand why you want to remove this information, that
is already there, is not harder to get with Enchant than it is without
it, and the code which supports it is already there?

I'm not proposing to remove this information. I am proposing not to add it for Enchant yet (because that will require extra work and code), and I am hoping to end up with a simpler way to get it, via the API.

http://rrt.sc3d.org

From:	Reuben Thomas
Subject:	bug#17742: Acknowledgement (Support for enchant?)
Date:	Tue, 20 Dec 2016 21:43:32 +0000