[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their
From: |
Mingye Wang (Arthur2e5) |
Subject: |
[bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts |
Date: |
Wed, 23 Nov 2016 21:03:56 -0500 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 |
Hi,
It seems to me that the implementation of a few Windows code pages in
libiconv does not behave like their Windows counterparts.
For clarity I am using the $'ansi-c-escape' literal in bash, with my
console set to UTF-8. `iconv-version' returns iconv (Ubuntu GLIBC
2.23-0ubuntu4) 2.23.
Missing euro sign in cp936
--------------------------
The single-byte euro sign at 0x80 might be the most well-known
modification that Microsoft has done to GBK. But well, it's not present
in libiconv:
$ iconv -f cp936 -t utf-8 <<< $'\x80'
iconv: illegal input sequence at position 0
$ iconv -t cp936 -f utf-8 <<< $'\u20ac' | hexdump -C
iconv: illegal input sequence at position 0
cp950 has no mappings for HKSCS
-------------------------------
Microsoft have released a some updates to code page 950 so it includes
HKSCS. Among these updates is the well-known "cp951" hack.[1]
[1]: https://blogs.msdn.microsoft.com/shawnste/2007/03/12/cp-951-hkscs/
But well, iconv's cp950 does not even know the first Big5-EUDC
character[2] in HKSCS:
$ iconv -f cp950 -t utf-8 <<< $'\x87\x40'
iconv: illegal input sequence at position 0
[2]:
http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/hkscs-2008-big5-iso.txt
The reverse does not work either:
$ iconv -t cp936 -f utf-8 <<< $'\u43f0' | hexdump -C
iconv: illegal input sequence at position 0
$ iconv -t cp936 -f utf-8 <<< $'\uf266' | hexdump -C
iconv: illegal input sequence at position 0
... where the latter is one of these sequential PUA assignments for
Big5-EUDC seen in MS's best-fit chart.[3] Since it's a bidirectional
conversion, this assignment is not part of "best fit" behavior per [4].
[3]:
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt
[4]:
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt
0x81 and 0x8d for cp1252, etc.
------------------------------
Windows' single-byte code pages map like latin-1 (with C0 and C1)
bidirectionally if no other values are defined for these bytes. libiconv
does not display this behavior for cp1250, cp1252, etc.
$ iconv -f cp1252 -t utf-8 <<< $'\x81'
iconv: illegal input sequence at position 0
$ iconv -f cp1252 -t utf-8 <<< $'\x8d'
iconv: illegal input sequence at position 0
$ iconv -f cp1250 -t utf-8 <<< $'\x81'
iconv: illegal input sequence at position 0
$ iconv -t cp1252 -f utf-8 <<< $'\u0081' | hexdump -C
iconv: illegal input sequence at position 0
--
Regards,
Arthur2e5
signature.asc
Description: OpenPGP digital signature
- [bug-gnu-libiconv] cp936, cp950, cp1252, etc. does not behave like their windows counterparts,
Mingye Wang (Arthur2e5) <=