|
From: | Link |
Subject: | Re: [bug-gnu-libiconv] gb18030 for 0x215d7 |
Date: | Mon, 29 May 2023 08:31:49 +0800 |
> > > A bijective 1-1 conversion table does not provide the best user
> > experience
> > in this situation.
> I don't insist on 1-1 conversion any more since the one in PUA should
> retire some day.
Good. Now we are in the same boat.
> Figured out a little history:
> GB18030/2000 (up to Ext-A): 0xFE6C -> U+E831 (PUA)
> Character adopted by Unicode U+215D7 (Ext-B)
> GB18030/2005 adopted Ext-B: 0x9536B937 -> U+215D7
Yep. Based on the second printing of GB18030-2000 (the first printing had
multiple mistakes), I had noted:
* p. 81 0xFE51..0xFE53 U+E816..U+E818
* p. 81 0xFE59 U+E81E
* p. 81 0xFE61 U+E826
* p. 81 0xFE66..0xFE67 U+E82B..U+E82C
* p. 81 0xFE6C..0xFE6D U+E831..U+E832
* p. 81 0xFE76 U+E83B
* p. 81 0xFE7E U+E843
* p. 81 0xFE90..0xFE91 U+E854..U+E855
* p. 81 0xFEA0 U+E864
So, it mapped 0xFE6C -> U+E831.
U+215D7 was added in Unicode 3.1 (2001).
The mapping table for GB18030-2005 in GNU libiconv is made to help transition
from the PUA code point to the U+215D7 code point:
0xFE6C -> U+215D7
0x9536B937 -> U+215D7
In the other direction libiconv does this:
U+215D7 -> 0xFE6C
U+E831 -> 0xFE6C
Or should it better be this?
U+215D7 -> 0x9536B937
U+E831 -> 0x9536B937
For comparison [1]:
* glibc (up to version 2.35 at least), which implements GB18030-2005, maps
U+215D7 -> 0xFE6C
U+E831 -> 0xFE6C
* JDK 5 maps
U+215D7 -> 0x9536B937
U+E831 -> 0xFE6C
> The real question here:
> U+215D7 -> GB18030: 0xFE6C or 0x9536B937?
> I think 0x9536B937 is the better choice, because Ext-B characters in
> GB18030 are all coded in 4 bytes.
That would still be quite arbitrary.
I think, before 2010, when one could not assume that the GB18030 fonts
had a glyph for 0x9536B937, it was probably best to map
U+215D7 -> 0xFE6C
U+E831 -> 0xFE6C
However, meanwhile the GB18030-2022 standard has been released, and it
effectively retires the PUA mappings (making them optional in the fonts). [2]
It maps
0xFE6C <-> U+E831
0x9536B937 <-> U+215D7
So, the underlying assumptions are that
- the fonts now all have glyphs for 0x9536B937,
- uses of these characters in files should have migrated (or should migrate?)
to 0x9536B937.
Thus, now, libiconv's GB18030-2005 converter should better map
U+215D7 -> 0x9536B937
U+E831 -> 0x9536B937
for a seamless transition.
Bruno
[1] https://www.haible.de/bruno/charsets/conversion-tables/GB18030.html
[2] https://ken-lunde.medium.com/the-gb-18030-2022-standard-3d0ebaeb4132
[Prev in Thread] | Current Thread | [Next in Thread] |