[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] GB2312 incompatible with GB18030; violation of GB
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] GB2312 incompatible with GB18030; violation of GB 18030 "principles" |
Date: |
Thu, 29 Sep 2016 13:02:54 +0200 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hello,
Mingye Wang wrote:
> I am not sure if someone has brought this up before, as what I am
> reporting is, in fact, a well-documented issue. [1]
> [1]: https://en.wikipedia.org/wiki/GB_2312#Two_implementations_of_GB2312
>
> iconv encodes the GB code points A1A4 and A1AA differently for GB 2312
> and GB 18030:
>
> bytes gb2312 gb18030
> ----- ------ -------
> A1A4 U+00B7 U+30FB
> A1AA U+2014 U+2015
>
> This slight difference breaks compatibility between these two encodings,
> a principle of the mandatory GB 18030[^1] standard:
> [^1]: -2000 and -2005. In 2000 it says "de facto internal encoding".
1) I can not reproduce what you mean here. Regardless whether 'iconv' refers
to GNU libiconv's iconv binary or GNU libc's /usr/bin/iconv binary, I get
$ printf '\xa1\xa4\xa1\xaa' | iconv -f GBK -t UCS-4BE | hd
000000 00 00 00 B7 00 00 20 14 ...... .
$ printf '\xa1\xa4\xa1\xaa' | iconv -f GB18030 -t UCS-4BE | hd
000000 00 00 00 B7 00 00 20 14 ...... .
$ printf '\x00\x00\x20\x15' | iconv -f UCS-4BE -t GBK | hd
000000 A8 44 .D
$ printf '\x00\x00\x20\x15' | iconv -f UCS-4BE -t GB18030 | hd
000000 A8 44 .D
where 'hd' is
hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" " 16/1 "%_p" "\n"'
Please, can you show an application of the iconv program or iconv()
function that exhibits the problem you mean?
2) Even if iconv would behave the way you say, it would not be a problem.
Reasons:
* GB2312 (1980) has been superseded by GBK and GB18030.
(Cited from https://en.wikipedia.org/wiki/GB_2312)
* GB2312 as a standard and GBK as a specification don't specify
an encoding table; they only specify characters/glyphs. GB18030
is the first standard in this area that specifies also an encoding table.
When the standard says that it is backward compatible with GB2312,
it means that at a specified code point you will find the same
character or glyph as described in the (printed!) tables from GB2312.
Therefore the way some software maps GB2312 to Unicode code points is
irrelevant for GB18030. Such mappings have historically differed by vendors
(see http://haible.de/bruno/charsets/conversion-tables/GB2312.html), and
one of the points of the GB18030 standard is that it leaves legacy encoding
problems behind - freeing itself from the baggage of the past.
Regards,
Bruno