Re: Ispell and unibyte characters

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Ispell and unibyte characters

From:	Agustin Martin
Subject:	Re: Ispell and unibyte characters
Date:	Fri, 20 Apr 2012 17:25:32 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Sun, Apr 15, 2012 at 10:40:29PM -0400, Stefan Monnier wrote:
> > Imagine Catalan dictionary with iso-8859-1 "·" in otherchars and other
> > dictionary (I am guessing the possibility to be more general, do not
> > actually have a real example of something different from our Debian
> > file with all info put together) with another upper char in
> > otherchars, but in a different encoding (e.g., koi8r).
> 
> You're still living in Emacs-21/22: since Emacs-23, basically chars aren't
> associated with their encoding (actually charset) any more.

Not once inside Emacs, but when reading a file, encoding matters and in
some corner cases with mixed charsets Emacs may get wrong chars
(with no sane way to make it automatically get the right ones), see
attached file.

BTW, I do not even have Emacs-21/22 installed, I am testing this in
Emacs23/24 (together with XEmacs to check that I do not introduce
additional incompatibilities)

> > The only possibility to have both coexist as chars in the same file is
> > to use multibyte UTF-8 chars instead of mixed unibyte iso-8859-1 and
> > koi8r, so Emacs properly gets chars when reading the file (if properly
> > guessing file coding-system).
> 
> Not at all, there are many encodings which cover the superset of
> iso-8859-* and koi8-*.  UTF-8 is the more fashionable one nowadays, but
> not anywhere close to the only one. e.g. there's also iso-2022,
> emacs-mule, and then some.

Sorry, should have written something like supersets, put UTF-8 as an example.

> > I'd however use this only in personal ~/.emacs files and if needed.
> 
> Why?  It would make the code more clear and simpler.

To make my Debian changes minimal I prefer to keep compatibility with
XEmacs when possible. That makes my life easier when adapting changes in FSF
Emacs repo to Debian. Seems that XEmacs has very recently added support for
automatic on-the-fly UTF-8 parsing, so my POV may change, but I admit I am
currently biassed to the 7bit \xxx strings. 

Since Emacs should now (for some days) use [:alpha:] in "Casechars" and
"Not-Casechars" for global dicts, I think we should not worry very much about
this from Emacs side, just for Otherchars in the very few cases it contains
an upper char (none in current ispell.el). And for that I still personally
prefer keep using for now the 7bit string "\xxx".

> > That is true for files with a single encoding.  However, the problem
> > happens when a file has mixed encodings like in the Debian example I
> > mentioned.  I know, this will not happen in real manually edited files,
> > but can happen and happens in aggregates like the one I mentioned.
> 
> That's an old solved problem.

May be we are speaking about different things, but as I understand this, it
does not seem so. And I do not think this can be solved in a robust enough
way for all files. See attached file and comments below.

> > If file is loaded with a given coding-system-for-read chars in that
> > coding-system will be properly interpreted by Emacs when reading, but
> > not the others. Something like that happened with
> > iso-8859-1/iso-8859-15 chars in
> 
> That was then.  Not any more.

I think you mean that iso-8859-* chars are currently unified. I am aware of
that, but I am speaking about something different, mixed encodings, also
discussed in that thread together with the iso-8859-1/iso-8859-15 problems.

See attached file. It contains middledot in two encodings, UTF-8 in first
line and latin1 in the second, together with something that was originally
written as iso-8859-7 lowercase greek zeta. In my iso-8859-1 box emacs24
(emacs-snapshot_20120410) reads it as

--
Â·
·
æ
--

so gets the wrong char both for UTF-8 and for greek lowercase zeta. In a
different environment Emacs may have guessed that first line is UTF-8, but
I do not see a robust enough to properly guess all the mixed encodings for
a small file like this.

That is the kind of things I am now dealing with for Debian. Currently not a
big problem for Emacs after [:alpha:] changes for casechars/not casechars 
(the chance that a new dict adds otherchars in incompatible charsets is
small), however this can still happen us for XEmacs in that aggregated
file.

Sorry if I did not make that clear enough and helped making this thread this
long.

Regards,

-- 
Agustin

test.txt
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Ispell and unibyte characters, (continued)
- Re: Ispell and unibyte characters, Eli Zaretskii, 2012/04/26

Prev by Date: Re: bootstrap w/gcc-4.8.0 + nonzero MALLOC_PERTURB_ -> ./temacs segfault
Next by Date: Re: flyspell.el and non-word characters in CASECHARS
Previous by thread: Re: Ispell and unibyte characters
Next by thread: Re: Ispell and unibyte characters
Index(es):
- Date
- Thread