[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Updating iconv tables
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Updating iconv tables |
Date: |
Fri, 13 Jun 2008 01:59:16 +0200 |
User-agent: |
KMail/1.5.4 |
Dear Jim Breen,
> > You have a misconception of what EUC-JP is. EUC-JP is a character encoding
> > scheme based on three standards: ASCII, JIS X 0208, and JIS X 0212. These
> > are standards issued by Japanese authorities, and carved in stone. Anyone
> > who thinks that EUC-JP tables have to be "kept up-to-date", is asking for
> > deviation from standards, and is asking for interoperability problems!
>
> You are out-of-date there. EUC-JP also includes JIS X 0213 ...
Wrong. The ultimative reference (standard) for character sets and their
definition, regarding their practical use, is the IANA character set registry:
http://www.iana.org/assignments/character-sets
It says that EUC-JP is composed of
code set 0: US-ASCII (a single 7-bit byte set)
code set 1: JIS X0208-1990 (a double 8-bit byte set)
restricted to A0-FF in both bytes
code set 2: Half Width Katakana (a single 7-bit byte set)
requiring SS2 as the character prefix
code set 3: JIS X0212-1990 (a double 7-bit byte set)
restricted to A0-FF in both bytes
requiring SS3 as the character prefix
> The codepoint I raised arrived in JIS X 0213.
> See: http://en.wikipedia.org/wiki/JIS_X_0213 for an overview.
This page refers to http://en.wikipedia.org/wiki/EUC
which says that the encoding that looks like EUC-JP but uses JIS X 0213
is called EUC-JISX0213.
And indeed the character that you meant to show me (bytes 0xAD 0xEA)
in EUC-JISX0213 is U+3231. In EUC-JISX0213, but not in EUC-JP.
> You can think of JIS X 0213 as an enhancement/replacement for JIS X 0208.
In the same sense, you can "think of" EUC-JISX0213 as an enhancement of
EUC-JP. But this "enhancement" has two caveats:
1) Compared to EUC-JP, EUC-JISX0213 removes 6068 code points, and adds
4355 code points instead. It by no way an "enhancement" to drop more
1000 characters!
2) EUC-JISX0213 can be used via 'iconv', but cannot be used as a locale
encoding in glibc based systems. This is because glibc has chosen to
use Unicode characters as 'wchar_t' representation, and there are some
characters in JISX0213 which don't map 1:1 to Unicode (rather 1:2,
requiring the use of combining Unicode characters).
> Of course EUC-JP tables need to be kept up-to-date.
There is nothing to keep up-to-date. EUC-JP is based on JISX 0208 and JISX 0212.
JISX 0213 is not an new version of JISX 0208 or JISX 0212, it is a new and
*different* standard. Therefore in glibc we call it EUC-JISX0213.
> > Take a look at
> > http://www.haible.de/bruno/charsets/conversion-tables/EUC-JP.html
> > to see how many variants of EUC-JP already exist!
>
> Sadly your WWW page omits any mention of JIS X 0213.
That's because EUC-JISX0213 is not even remotely backward compatible with
EUC-JP.
Look at
http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html
> Sun has simply kept up with the developments in Japanese coding. These are
> *not* vendor extensions.
I don't know what Sun did. But if they were providing EUC-JISX0213 under the
name "EUC-JP", that would be a very bad (because not standards compliant) move.
> In case you think I am talking through my hat, I must point out that I am
> one of only a handful of non-Japanese people who have participated in the
> development of the Japanese standards.
Oh, you are arguing by intimidation? Then I have to point out that I have
contributed implementations of EUC-JISX0213 and SHIFT_JISX0213 to GNU libc
and GNU libiconv in 2002, before any other vendor's iconv had it.
> I am happy to work with you in getting the full set of current Japanese
> codes into iconv. As it stands at the moment, the GNU issue does not
> adequately hand all the standard Japanese codes.
As it stands at the moment, GNU libc and GNU libiconv have all the standard
Japanese encodings; only you confused the names.
Bruno
PS: I have no idea in which encoding your EDICT dictionary now actually is.
If you started out writing it in EUC-JP and at some point switched to
using EUC-JISX0213, you may have dozens of entries which are correct in
EUC-JP but wrong in EUC-JISX0213, and dozens of entries for which it is
the opposite.