[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like t
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts |
Date: |
Thu, 24 Nov 2016 21:37:30 +0100 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hi,
Mingye Wang wrote:
> `iconv-version' returns iconv (Ubuntu GLIBC 2.23-0ubuntu4) 2.23.
There are two implementations of 'iconv' in GNU, one in glibc, and one
in libiconv. Here you are writing about glibc behaviour, for which you
can report bugs in the glibc bugzilla. But I can give you some background
anyway.
> It seems to me that the implementation of a few Windows code pages in
> libiconv does not behave like their Windows counterparts.
Yes. http://haible.de/bruno/charsets/conversion-tables/index.html
shows the differences between different converters for each encoding,
and you see differences between different Windows versions, between
glibc and libiconv and many other converters.
You can see from http://haible.de/bruno/charsets/conversion-tables/sources.html
that the site considers Windows versions up to October 2016.
> Missing euro sign in cp936
> --------------------------
>
> The single-byte euro sign at 0x80 might be the most well-known
> modification that Microsoft has done to GBK. But well, it's not present
> in libiconv:
>
> $ iconv -f cp936 -t utf-8 <<< $'\x80'
> iconv: illegal input sequence at position 0
> $ iconv -t cp936 -f utf-8 <<< $'\u20ac' | hexdump -C
> iconv: illegal input sequence at position 0
Reference:
http://haible.de/bruno/charsets/conversion-tables/GB2312.html
libiconv includes the '0x80 U+20AC' mapping for CP936; glibc doesn't.
Maybe because this Euro sign is not contained in the GBK 1.0 standard
(see https://en.wikipedia.org/wiki/GBK#GBK_1.0). Maybe because U+20AC
is mapped to a different codepoint in GB18030 and GB18030 is meant to
be an extension of GBK.
> cp950 has no mappings for HKSCS
> -------------------------------
>
> Microsoft have released a some updates to code page 950 so it includes
> HKSCS. Among these updates is the well-known "cp951" hack.[1]
> [1]: https://blogs.msdn.microsoft.com/shawnste/2007/03/12/cp-951-hkscs/
>
> But well, iconv's cp950 does not even know the first Big5-EUDC
> character[2] in HKSCS:
>
> $ iconv -f cp950 -t utf-8 <<< $'\x87\x40'
> iconv: illegal input sequence at position 0
>
> [2]:
> http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt
>
> The reverse does not work either:
>
> $ iconv -t cp936 -f utf-8 <<< $'\u43f0' | hexdump -C
> iconv: illegal input sequence at position 0
> $ iconv -t cp936 -f utf-8 <<< $'\uf266' | hexdump -C
> iconv: illegal input sequence at position 0
>
> ... where the latter is one of these sequential PUA assignments for
> Big5-EUDC seen in MS's best-fit chart.[3] Since it's a bidirectional
> conversion, this assignment is not part of "best fit" behavior per [4].
> [3]:
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
> [4]:
> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
Reference:
http://haible.de/bruno/charsets/conversion-tables/Big5.html
It is not a good idea to propagate arbitrary modifications of existing
encodings,
because it causes interoperability problems. You are actually calling it a
"hack".
As you can see from http://haible.de/bruno/charsets/conversion-tables/Big5.html
(search for windows-2016/CP950.TXT), you can see that on Windows 10, CP950
does *not* contain HKSCS extension mappings.
Likewise for the official mapping tables provided by Microsoft:
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
The problems of attempting to unify CP950 with HKSCS extensions are being
discussed in https://github.com/whatwg/encoding/issues/75.
> Since it's a bidirectional
> conversion, this assignment is not part of "best fit" behavior per [4].
You're mistaken. The point of the "best fit" converters in Microsoft is that
they document also the conversions that go only in one direction. i.e. that
don't round-trip.
> 0x81 and 0x8d for cp1252, etc.
> ------------------------------
>
> Windows' single-byte code pages map like latin-1 (with C0 and C1)
> bidirectionally if no other values are defined for these bytes. libiconv
> does not display this behavior for cp1250, cp1252, etc.
>
> $ iconv -f cp1252 -t utf-8 <<< $'\x81'
> iconv: illegal input sequence at position 0
> $ iconv -f cp1252 -t utf-8 <<< $'\x8d'
> iconv: illegal input sequence at position 0
> $ iconv -f cp1250 -t utf-8 <<< $'\x81'
> iconv: illegal input sequence at position 0
> $ iconv -t cp1252 -f utf-8 <<< $'\u0081' | hexdump -C
> iconv: illegal input sequence at position 0
Reference:
http://haible.de/bruno/charsets/conversion-tables/CP1252.html
See the tables provided by Microsoft:
https://msdn.microsoft.com/en-us/library/cc195054.aspx
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
As you can see, 0x81 and 0x8D are not mapped.
Yes the Windows converter does it differently...
In summary, choosing the right conversion table is a tricky choice. Don't
think that what a converter on Windows does it always the right or best option!
Bruno