[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like t
From: |
Mingye Wang (Arthur2e5) |
Subject: |
Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts |
Date: |
Thu, 24 Nov 2016 16:33:22 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 |
Hi,
Bruno Haible wrote:
There are two implementations of 'iconv' in GNU, one in glibc, and one
in libiconv. Here you are writing about glibc behaviour, for which you
can report bugs in the glibc bugzilla. But I can give you some background
anyway.
Hmm. I should find some time to forward some parts of this report (like
cp936) to glibc.
>> cp936
You can see from http://haible.de/bruno/charsets/conversion-tables/sources.html
that the site considers Windows versions up to October 2016.
Great collection... again.
libiconv includes the '0x80 U+20AC' mapping for CP936; glibc doesn't.
Maybe because this Euro sign is not contained in the GBK 1.0 standard
(see https://en.wikipedia.org/wiki/GBK#GBK_1.0). Maybe because U+20AC
is mapped to a different codepoint in GB18030 and GB18030 is meant to
be an extension of GBK.
I guess these tables should be kept 'like Windows' as long as they are
referring to a Windows Code Page. It seems that glibc simply did an
alias... well.
Also, confirmed working in Cygwin where `iconv` actually comes from GNU
libiconv 1.14.
cp950 has no mappings for HKSCS
-------------------------------
Reference:
http://haible.de/bruno/charsets/conversion-tables/Big5.html
It is not a good idea to propagate arbitrary modifications of existing
encodings,
because it causes interoperability problems. You are actually calling it a
"hack".
I am calling it a hack because MS is pushing a separate code page number
(951) to mask 950 in their Windows XP support package. Not quite the
case for later Windows releases...
As you can see from http://haible.de/bruno/charsets/conversion-tables/Big5.html
(search for windows-2016/CP950.TXT), you can see that on Windows 10, CP950
does *not* contain HKSCS extension mappings.
It seems that libiconv *does* have private area mappings for Big5's
user-defined blocks in cp950. glibc aliased CP950 to Big5, so it's going
to be their fault. Another false alarm.
On cygwin iconv seems to be able to accept \x87\x40 and give \ue000 but
not the other way around.
Likewise for the official mapping tables provided by Microsoft:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
*todo: bug glibc for cp950 EUDC too*
Since it's a bidirectional
conversion, this assignment is not part of "best fit" behavior per [4].
You're mistaken. The point of the "best fit" converters in Microsoft is that
they document also the conversions that go only in one direction. i.e. that
don't round-trip.
I thought these round-trip parts are not doing "best fit" and should be
considered (somehow) normative.
0x81 and 0x8d for cp1252, etc.
------------------------------
Reference:
http://haible.de/bruno/charsets/conversion-tables/CP1252.html
See the tables provided by Microsoft:
https://msdn.microsoft.com/en-us/library/cc195054.aspx
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
As you can see, 0x81 and 0x8D are not mapped.
This actually brings up why I keep going straight to the round-trip
subset of these "best fit" mappings...
MS used to serve its cp950 mapping at their "go global" site[1], which
now redirects to a "not found" page at[2]. This page points you to the
best fit mappings that, in turn, looks like what I have on current
versions of Windows. As a result, I actually thought these "WINDOWS"
mappings were somehow obsolete.
[1]:
https://web.archive.org/web/20110807111716/http://msdn.microsoft.com/en-us/goglobal/cc305155
[2]: https://msdn.microsoft.com/en-us/globalization/mt767590
Yes the Windows converter does it differently...
>
In summary, choosing the right conversion table is a tricky choice. Don't
think that what a converter on Windows does it always the right or best option!
I still expect Windows Code Pages to be defined by Windows itself...
--
Regards,
Arthur2e5
signature.asc
Description: OpenPGP digital signature