Re: Ispell and unibyte characters

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ispell and unibyte characters

From:	Agustin Martin
Subject:	Re: Ispell and unibyte characters
Date:	Mon, 26 Mar 2012 19:39:12 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Sat, Mar 17, 2012 at 08:46:54PM +0200, Eli Zaretskii wrote:
> The doc string of ispell-dictionary-alist says, inter alia:
> 
>   Each element of this list is also a list:
> 
>   (DICTIONARY-NAME CASECHARS NOT-CASECHARS OTHERCHARS MANY-OTHERCHARS-P
>         ISPELL-ARGS EXTENDED-CHARACTER-MODE CHARACTER-SET)
>   ...
>   CASECHARS, NOT-CASECHARS, and OTHERCHARS must be unibyte strings
>   containing bytes of CHARACTER-SET.  In addition, if they contain
>   a non-ASCII byte, the regular expression must be a single
>   `character set' construct that doesn't specify a character range
>   for non-ASCII bytes.
> 
> Why the restriction to unibyte character sets?  This is quite a
> serious limitation, given that the modern spellers (aspell and
> hunspell) use UTF-8 as their default encoding.

Hi Eli,

At least for aspell ispell.el already uses utf8 as default communication
encoding and [:alpha:] as CASECHARS (and ^[:alpha:] as NOT-CASECHARS). 
OTHERCHARS is guessed from aspell .dat file for given dictionary.

Since currently it is not possible to ask hunspell for installed
dictionaries (hunspell -D does not return control to the console)
no one tried something similar for hunspell.

> The only reason for this limitation I could find is in
> ispell-process-line, which assumes that the byte offsets returned by
> the speller can be used to compute character position of the
> misspelled word in the buffer.  Are there any other places in
> ispell.el that assume unibyte characters?

Not sure if using utf8 and [:alpha:] has caused some problem for aspell,
I do not remember reports about this. 

> If ispell-process-line is the only place, then it should be easy to
> extend it so it handles correctly UTF-8 in addition to unibyte
> character sets.
> 
> In any case, I see no reason to specify CASECHARS, NOT-CASECHARS, and
> OTHERCHARS as ugly unibyte escapes, since their usage is entirely
> consistent with multibyte characters: they are used to construct
> regular expressions and match buffer text against those regexps.  

IIRC, the reason to use octal escapes is mostly that they are encoding
independent. Otherwise a .emacs file may have mixed unibyte/multibyte
encodings.

Current limitation in docstring may be only something left from old times. I
will try to look with recent ispell american dict, which can be called in
utf8. Will let you know.

Regards,

-- 
Agustin

[Prev in Thread]

Current Thread

[Next in Thread]

Ispell and unibyte characters, Eli Zaretskii, 2012/03/17
- Re: Ispell and unibyte characters, Agustin Martin <=
  - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/03/26
    - Re: Ispell and unibyte characters, Lennart Borgman, 2012/03/26
    - Re: Ispell and unibyte characters, Agustin Martin, 2012/03/28
    - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/03/29
    - Re: Ispell and unibyte characters, Andreas Schwab, 2012/03/29
    - Re: Ispell and unibyte characters, Eli Zaretskii, 2012/03/30

Prev by Date: Re: Bug #892245 “Problem wih nXhtml in Emacs 24” : Bugs : nXhtml
Next by Date: Re: Ispell and unibyte characters
Previous by thread: Ispell and unibyte characters
Next by thread: Re: Ispell and unibyte characters
Index(es):
- Date
- Thread