bug#17742: Acknowledgement (Support for enchant?)

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17742: Acknowledgement (Support for enchant?)

From:	Eli Zaretskii
Subject:	bug#17742: Acknowledgement (Support for enchant?)
Date:	Mon, 19 Dec 2016 18:01:27 +0200

> From: Reuben Thomas <rrt@sc3d.org>
> Date: Sun, 18 Dec 2016 23:39:54 +0000
> Cc: 17742@debbugs.gnu.org
> 
> I have not had any response to my enquiries yet, but I did some research, and 
> neither GNU Aspell nor hunspell offer any way to get this information (about 
> character classes of dictionaries) via their APIs.

They provide this information in the dictionaries, and we glean it
from there.  See ispell-parse-hunspell-affix-file and
ispell-aspell-find-dictionary.

> This suggests that they do not see a need for it. Perhaps it is worth 
> confirming whether Emacs really needs this information?
> 
> As far as I can see, it is used only in flyspell-word, for per-word 
> spell-checking. (The only caller outside flyspell.el is erc, which has a 
> FIXME saying not to call flyspell-word.) As far as I can see, the code 
> assumes that words are a convenient unit to check and cache, though there's 
> no definite requirement for that: in particular, the spelling checkers will 
> say what words are incorrectly spelled and where they are without having to 
> be given precisely the word. I guess that other editors and word processors 
> work this way.

Maybe there's a misunderstanding: I'm talking about the CASECHARS,
NOT-CASECHARS, and OTHERCHARS parts of the dictionary data in
ispell-dictionary-alist.  These are definitely used in ispell.el, via
the corresponding accessor functions ispell-get-casechars,
ispell-get-not-casechars, and ispell-get-otherchars, which see.

Each dictionary can (and many do) use some of the punctuation
characters in the words it can handle.  A notable example is the
apostrophe ' in English, used for the various suffixes that spellers
support; similar features exist in other languages, but with possibly
different punctuation characters.  Ispell.el must match that by using
the speller's notion of a word, which must be independent of the
current major mode's idea of what a word is.  This is where these
character sets come into play, and I really cannot see how can
ispell.el work well without using them as it does now.

So we do need this information.  If Enchant doesn't provide it, we
could still use the same technique as with Aspell and Hunspell,
provided that we can figure out which back end(s) is/are used by
Enchant.  Is that doable?

> For example, aspell.h contains the following notice about 
> aspell_document_checker_process:
> 
>  * The string passed in should only be split on
>  * white space characters.

Ispell.el also supports spell-checking by words, in which case the
above is not useful, because we need to figure out what is a word.
Moreover, even when we send entire lines to the speller, we want to
skip lines that include only non-word characters.  Just look at the
callers of the above-mentioned accessor functions, and you will see
how we use them.

> Basic tests using [[:alpha:]] for casechars and [^[:alpha:]] for 
> not-casechars seem to work OK.

For which language and dictionary?  This will definitely do the wrong
thing for Hunspell he_IL dictionary I have here, which says:

  WORDCHARS אבגדהוזחטיכלמנסעפצקרשתםןךףץ'"

That is, it wants ' and " to be treated as word-constituent
characters.  As another example, I can envision a dictionary of
acronyms and abbreviations, which might want to treat the period as a
word-constituent character, to support the likes of "a.k.a.".
Etc. etc. -- this is up to the dictionary to decide, and Emacs must
follow suit.

Also, please note that [:alpha:] in Emacs 25 means a much larger set
of characters than in previous versions, see NEWS.  It will in general
catch strings of characters that cannot possibly be TRT for a
single-language dictionary.  E.g.,

  (string-match "[[:alpha:]]+" "aβגд") => 0

> I meant [[:graph:]] and [^[:graph:]].

This will match an even larger set in Emacs 25, I don't think we will
ever want that for spell-checking.

> Also, as I realised while preparing the patch for bug#25230, it is only 
> hunspell that has special information
> about character classes. All the others just use [:alpha:]. So if it's good 
> enough for ispell and aspell, can't it be
> good enough for enchant? (It just means that for now "direct Hunspell" is 
> arguably better than "Hunspell via
> Enchant".)

Hunspell is the most modern and sophisticated speller, we certainly
don't want to degrade it.  Also, Aspell uses the dictionaries at least
for some of this info, see the function I pointed to above.

Once again, if Enchant uses a back-end for which we know how to find
this information, we should do so.

Bottom line, this information cannot be thrown away or ignored.  It is
important for correctly interfacing with a dictionary and for doing
TRT as the users expect.  Any modern speller program would benefit
from it, and therefore we should strive to provide such information to
ispell.el whenever we possibly can.

Thanks.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#17742: Acknowledgement (Support for enchant?), (continued)

Prev by Date: bug#21028: Performance regression in revision af1a69f4d17a482c359d98c00ef86fac835b5fac (Apr 2014).
Next by Date: bug#25216: 26.0.50 [regression]; Curly quotes are not found in some sizes of ‘Terminus’ font
Previous by thread: bug#17742: Acknowledgement (Support for enchant?)
Next by thread: bug#17742: Acknowledgement (Support for enchant?)
Index(es):
- Date
- Thread