[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] GB2312 incompatible with GB18030; violation of GB
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] GB2312 incompatible with GB18030; violation of GB 18030 "principles" |
Date: |
Thu, 29 Sep 2016 16:41:57 +0200 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hi Mingye Wang,
> > 1) I can not reproduce what you mean here. Regardless whether 'iconv' refers
> > to GNU libiconv's iconv binary or GNU libc's /usr/bin/iconv binary, I get
> >
> > $ printf '\xa1\xa4\xa1\xaa' | iconv -f GBK -t UCS-4BE | hd
> > 000000 00 00 00 B7 00 00 20 14 ...... .
> > $ printf '\xa1\xa4\xa1\xaa' | iconv -f GB18030 -t UCS-4BE | hd
> > 000000 00 00 00 B7 00 00 20 14 ...... .
> >
> > Please, can you show an application of the iconv program or iconv()
> > function that exhibits the problem you mean?
>
> Well, I mean GB 2312, not GBK.
>
> $ printf '\xA1\xA4\xA1\xAA' | iconv -f gb2312 -t UCS-4BE | hd
> 000000 00 00 30 FB 00 00 20 15 ..0... .
OK, I can reproduce this. But it's really ultra-minor: Given that
GB2312 was defined by printing a set of characters on paper, and you
cannot see the difference between a 'MIDDLE DOT' and a 'KATAKANA MIDDLE DOT',
nor the difference between a 'EM DASH' and a 'HORIZONTAL BAR', these
differences are semantically irrelevant.
They are also practically irrelevant because GB2312 = EUC-CN is the Unix
encoding for simplified chinese before ca. 2000. Since then, GBK (pushed
by Microsoft) and GB18030 have taken over (besides Unicode, of course).
There are not many people who still have files from that era around.
> Similarly, the GBK encoder does not recognize U+30FB.
Microsoft decided that CP936 should not recognize U+30FB, and
published this mapping table under the name 'GBK'.
> It's mainly authors of ad hoc crawlers simply doesn't know about its
> existence[^1] that worried me. W3C's Encoding TR[1] specifies that a GBK
> decoder should be used for encoding=gb2312
Thanks for the pointer to this TR.
The mapping tables for GB2312 are irrelevant: since ca. 1995..1999
the market power of Microsoft has ensured that documents used the GBK
encoding, not the older EUC-CN tables. In other words, nowadays you
consider GBK legacy, and GB2312 is legacy of legacy.
The real problem addressed by the TR is that emails and HTML pages
written in GBK encoding often carry a 'gb2312' encoding label - for
17 years already, even today. As far as I can see, this TR addresses
it correctly.
> One thing to note, still, is that the GB tables are gone -- not even
> found in ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA. Since the
> official source is now gone, I am not sure if I should trust this
> specific mapping any more.
Well, it is in order to answer this question about "trust" in a mapping
table that I created these comparison web pages
http://haible.de/bruno/charsets/conversion-tables/
Bruno