emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: What exactly is chinese-big5?


From: Kenichi Handa
Subject: Re: What exactly is chinese-big5?
Date: Fri, 18 Apr 2008 10:32:15 +0900
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/23.0.60 (i686-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)

In article <address@hidden>, Eli Zaretskii <address@hidden> writes:

> Emacs 22.2 supports the chinese-big5 encoding.  However, I cannot find
> anywhere the precise description of which flavor(s) of BIG5 is/are
> supported.  The Wiki page http://en.wikipedia.org/wiki/Big5 describes
> half a dozen of extensions to the original Big5 encoding, so it would
> be good to know which one(s) we support.

Emacs supports full range of Big5 code space; i.e. 
   1st byte: 0xA1 .. 0xFE
   2nd byte: 0x40 .. 0x7E and 0xA1 .. 0xFE

It means that Emacs can decode all code points of Big5 that
fit in the above range.  And, Emacs 22 doesn't pay attention
to which code point is assigned to which Chinese character.
It has seperate character space for Big5 characters (in
charsets chinese-big5-1 and chinese-big5-2) and thus can
contain all possible characters.  Some code point may be
assigned to no character in some variant of Big5.  Emacs
22.2 simply doesn't care about that.

Emacs 23 support all code points of Big5 as well.  It at
first decodes Big5 charaters in a single seperate code space
(#x130000 and over).  Then, unify most of them with Unicode
by using a charset map distributed with glibc
(/usr/share/i18n/charmaps/BIG5.gz).  So, for instance Big5
A140 is decoded and unified to U+3000, but Big5 FEFE is just
decoded to #x134621 (out of unicode range).

> The specific situation where I needed to know this was when I was
> handed a file with what was supposed to be Chinese text and was asked
> to convert it to UTF-8.  detect-coding-region suggested chinese-big5
> as the only Chinese encoding for the non-ASCII characters in the file,
> so I tried that.  Interestingly enough, both `recode' and `iconv'
> refused to convert the file, no matter what flavor of Big5 (including
> cp950) I tried, but Emacs read the file with no problems and produced
> what seems like a valid UTF-8 encoding.  `iconv' 1.12 supports quite a
> few Big5 flavors, but they all choked on some characters in the file.

When you read a Big5 file of the byte sequence FEFE, and try
to write it by utf-8, Emacs 22.2 silently generates U+FFFD
(REPLACEMENT CHARACTER) as described in the docstring of
utf-8 coding system.   So, there's a possibility that the
file you wrote also contains that character.

On the other hand, Emacs 23 warns that only Big5 and
utf-8-emacs can encode it.

> So what exactly is chinese-big5 in Emacs, and how come it succeeds
> where the latest `iconv' fails?

As explained above.

> In particular, should I worry about
> possibly incorrect conversion by Emacs, where `iconv' barfs (the file
> is very large and I cannot proofread all the converted strings)?

In Emacs 22, you can read the written file by utf-8 and
search for U+FFFD.  In Emacs 23, you'll see the warning on
writing the file.

---
Kenichi Handa
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]