[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Possible CP932 conversions bug
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Possible CP932 conversions bug |
Date: |
Tue, 13 Dec 2016 17:58:14 +0100 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hello Maxim,
> While testing codepage conversions, I came across the following discrepancy:
> when
> converting from CP932 to UTF-16 certain characters get converted into
> different
> unicode on Linux (using iconv) and Mac (using libiconv). Looking at some CP932
> to Unicode tables online it appears that the Linux conversions are consistent
> with those tables, while the libiconv uses visually similar characters, but
> with different codes from the ones found in the aforementioned tables.
>
> As far as I can tell the issue happens with following characters:
>
> CP932 0x8160 -> Output [0x301C], Expected [0xFF5E] // Wavy dash
> CP932 0x8161 -> Output [0x2016], Expected [0x2225] // Vertical double line
> CP932 0x817C -> Output [0x2212], Expected [0xFF0D] // A dash
> CP932 0x8191 -> Output [0x00A2], Expected [0xFFE0] // Cent sign
> CP932 0x8192 -> Output [0x00A3], Expected [0xFFE1] // Pound sign
> CP932 0x81CA -> Output [0x00AC], Expected [0xFFE2] // Logical "not" sign
I confirm: These differences exist. With the tools from [1] and the tables
from [2], I get
$ ./table-diff glibc-2.23-iconv/CP932.TXT libiconv-1.14/CP932.TXT
***************
*** 160,163 ****
0x815F 0xFF3C # FULLWIDTH REVERSE SOLIDUS
! 0x8160 0xFF5E # FULLWIDTH TILDE
! 0x8161 0x2225 # PARALLEL TO
0x8162 0xFF5C # FULLWIDTH VERTICAL LINE
--- 160,163 ----
0x815F 0xFF3C # FULLWIDTH REVERSE SOLIDUS
! 0x8160 0x301C # WAVE DASH
! 0x8161 0x2016 # DOUBLE VERTICAL LINE
0x8162 0xFF5C # FULLWIDTH VERTICAL LINE
***************
*** 188,190 ****
0x817B 0xFF0B # FULLWIDTH PLUS SIGN
! 0x817C 0xFF0D # FULLWIDTH HYPHEN-MINUS
0x817D 0x00B1 # PLUS-MINUS SIGN
--- 188,190 ----
0x817B 0xFF0B # FULLWIDTH PLUS SIGN
! 0x817C 0x2212 # MINUS SIGN
0x817D 0x00B1 # PLUS-MINUS SIGN
***************
*** 208,211 ****
0x8190 0xFF04 # FULLWIDTH DOLLAR SIGN
! 0x8191 0xFFE0 # FULLWIDTH CENT SIGN
! 0x8192 0xFFE1 # FULLWIDTH POUND SIGN
0x8193 0xFF05 # FULLWIDTH PERCENT SIGN
--- 208,211 ----
0x8190 0xFF04 # FULLWIDTH DOLLAR SIGN
! 0x8191 0x00A2 # CENT SIGN
! 0x8192 0x00A3 # POUND SIGN
0x8193 0xFF05 # FULLWIDTH PERCENT SIGN
***************
*** 246,248 ****
0x81C9 0x2228 # LOGICAL OR
! 0x81CA 0xFFE2 # FULLWIDTH NOT SIGN
0x81CB 0x21D2 # RIGHTWARDS DOUBLE ARROW
--- 246,248 ----
0x81C9 0x2228 # LOGICAL OR
! 0x81CA 0x00AC # NOT SIGN
0x81CB 0x21D2 # RIGHTWARDS DOUBLE ARROW
It seems like the glibc variant is more closely based on the tables
published by Microsoft
unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP932.TXT
microsoft-2005/CP932.TXT
whereas the libiconv variant is more closely based on the the JISX0208 standard
unicode.org-mappings/EASTASIA/JIS/SHIFTJIS.TXT
It's hard to say which of the two is "better" today...
Bruno
[1] http://haible.de/bruno/charsets/conversion-tables/tools.html
[2] http://haible.de/bruno/charsets/conversion-tables/Shift_JIS.html