[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Big5-HKSCS
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Big5-HKSCS |
Date: |
Thu, 25 Nov 2010 00:40:38 +0100 |
User-agent: |
KMail/1.9.9 |
oCameLo wrote:
> These're different between CP950 and Big5-HKSCS. I'm not sure which
> one is correct, but CP950 is more likely, because 0xA244, 0xA246,
> 0xA247 shouldn't map to single-width characters. Also, \uFF0F and
> \uFF3C are very common characters, so Big5-HKSCS in libiconv might not
> be able to work with many CP951 files.
>
> Could you please tell me Big5-HKSCS in libiconv base on which kind of
> Big5, why not CP950?
>
> Thanks for your work, very much.
>
>
> Big5 CP950_TO_UCS HKSCS_TO_HKSCS
> 0xA145 0x2027 [‧] 0x2022 [•]
> 0xA14E 0xFE51 [﹑] 0xFF64 [、]
> 0xA15A 0x2574 [╴] nil
> 0xA1C2 0x00AF [¯] 0x203E [‾]
> 0xA1C3 0xFFE3 [ ̄] nil
> 0xA1C5 0x02CD [ˍ] nil
> 0xA1E3 0xFF5E [~] 0x223C [∼]
> 0xA1F2 0x2295 [⊕] 0x2641 [♁]
> 0xA1F3 0x2299 [⊙] 0x2609 [☉]
> 0xA1FE 0xFF0F [/] nil
> 0xA240 0xFF3C [\] nil
> 0xA241 0x2215 [∕] 0xFF0F [/]
> 0xA242 0xFE68 [﹨] 0xFF3C [\]
> 0xA244 0xFFE5 [¥] 0x00A5 [¥]
> 0xA246 0xFFE0 [¢] 0x00A2 [¢]
> 0xA247 0xFFE1 [£] 0x00A3 [£]
> 0xA2CC 0x5341 [十] nil
> 0xA2CE 0x5345 [卅] nil
> 0xA3E1 0x20AC [€] nil
> 0xF9FE 0x2593 [▓] 0xFFED [■]
The conversion table in libiconv is based on the Big5 conversion table that
was found on ftp.unicode.org in 1999 / 2000.
You're saying "CP950 is more likely", but the justification you give is very
weak. There are many many variants of Big5, see
http://www.haible.de/bruno/charsets/conversion-tables/Big5.html
http://www.haible.de/bruno/charsets/conversion-tables/BIG5-HKSCS.html
Also, where did you get your column "HKSCS_TO_HKSCS" from?
The file e_hkscs_2008.pdf that can be downloaded from
http://www.ogcio.gov.hk/ccli/eng/hkscs/document.html
does not explicitly state which version of Big5 is meant to be the base.
The only indication I can find is the table in section 3.4, which gives
the expected number of characters in three blocks. When I compare this
with the character counts in the various libiconv mapping tables, I get this:
Range A140..A3BF, expect 408 characters.
$ LC_ALL=C grep -c '^0x\(A[1-2]\|A3[4-B]\)' BIG*.TXT CP950.TXT | grep -v :0'$'
BIG5-2003.TXT:408
BIG5-HKSCS-1999.TXT:401
BIG5-HKSCS-2001.TXT:401
BIG5-HKSCS-2004.TXT:401
BIG5-HKSCS-2008.TXT:401
BIG5.TXT:401
CP950.TXT:408
Range A440..C67E, expect 5401 characters.
$ LC_ALL=C grep -c '^0x\(A[4-F]\|B\|C[0-5]\|C6[4-7]\)' BIG*.TXT CP950.TXT |
grep -v :0'$'
BIG5-2003.TXT:5401
BIG5-HKSCS-1999.TXT:5401
BIG5-HKSCS-2001.TXT:5401
BIG5-HKSCS-2004.TXT:5401
BIG5-HKSCS-2008.TXT:5401
BIG5.TXT:5401
CP950.TXT:5401
Range C940..F9D5, expect 7652 characters.
$ LC_ALL=C grep -c '^0x\(C[9-F]\|[DE]\|F[0-8]\|F9[4-C]\|F9D[0-5]\)' BIG*.TXT
CP950.TXT | grep -v :.'$'
BIG5-2003.TXT:7652
BIG5-HKSCS-1999.TXT:7652
BIG5-HKSCS-2001.TXT:7652
BIG5-HKSCS-2004.TXT:7652
BIG5-HKSCS-2008.TXT:7652
BIG5.TXT:7652
CP950.TXT:7652
Looking at the first block, it means that CP950 and BIG5-2003 are the most
likely ones that were meant. But these are different as well:
$ ./table-diff /tmp/CP950.TXT /tmp/BIG5-2003.TXT
***************
*** 22,24 ****
0xA155 0xFF5C # FULLWIDTH VERTICAL LINE
! 0xA156 0x2013 # EN DASH
0xA157 0xFE31 # PRESENTATION FORM FOR VERTICAL EM DASH
--- 22,24 ----
0xA155 0xFF5C # FULLWIDTH VERTICAL LINE
! 0xA156 0x2015 # HORIZONTAL BAR
0xA157 0xFE31 # PRESENTATION FORM FOR VERTICAL EM DASH
***************
*** 96,98 ****
0xA1C1 0x2105 # CARE OF
! 0xA1C2 0x00AF # MACRON
0xA1C3 0xFFE3 # FULLWIDTH MACRON
--- 96,98 ----
0xA1C1 0x2105 # CARE OF
! 0xA1C2 0x203E # OVERLINE
0xA1C3 0xFFE3 # FULLWIDTH MACRON
***************
*** 223,228 ****
0xA2A3 0x256F # BOX DRAWINGS LIGHT ARC UP AND LEFT
! 0xA2A4 0x2550 # BOX DRAWINGS DOUBLE HORIZONTAL
! 0xA2A5 0x255E # BOX DRAWINGS VERTICAL SINGLE AND RIGHT DOUBLE
! 0xA2A6 0x256A # BOX DRAWINGS VERTICAL SINGLE AND HORIZONTAL
DOUBLE
! 0xA2A7 0x2561 # BOX DRAWINGS VERTICAL SINGLE AND LEFT DOUBLE
0xA2A8 0x25E2 # BLACK LOWER RIGHT TRIANGLE
--- 223,228 ----
0xA2A3 0x256F # BOX DRAWINGS LIGHT ARC UP AND LEFT
! 0xA2A4 0x2501 # BOX DRAWINGS HEAVY HORIZONTAL
! 0xA2A5 0x251D # BOX DRAWINGS VERTICAL LIGHT AND RIGHT HEAVY
! 0xA2A6 0x253F # BOX DRAWINGS VERTICAL LIGHT AND HORIZONTAL HEAVY
! 0xA2A7 0x2525 # BOX DRAWINGS VERTICAL LIGHT AND LEFT HEAVY
0xA2A8 0x25E2 # BLACK LOWER RIGHT TRIANGLE
***************
*** 263,267 ****
0xA2CB 0x3029 # HANGZHOU NUMERAL NINE
! 0xA2CC 0x5341 # <CJK Ideograph>
! 0xA2CD 0x5344 # <CJK Ideograph>
! 0xA2CE 0x5345 # <CJK Ideograph>
0xA2CF 0xFF21 # FULLWIDTH LATIN CAPITAL LETTER A
--- 263,267 ----
0xA2CB 0x3029 # HANGZHOU NUMERAL NINE
! 0xA2CC 0x3038 # HANGZHOU NUMERAL TEN
! 0xA2CD 0x3039 # HANGZHOU NUMERAL TWENTY
! 0xA2CE 0x303A # HANGZHOU NUMERAL THIRTY
0xA2CF 0xFF21 # FULLWIDTH LATIN CAPITAL LETTER A
So, really, it's ambiguous.
I won't make a backward incompatible change to libiconv and glibc until there
is _clear_ evidence which variant of BIG5 is meant.
Bruno