[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-libunistring] Hangul Jamo vowels and trailing consonants should
From: |
Luis Javier Merino |
Subject: |
Re: [bug-libunistring] Hangul Jamo vowels and trailing consonants should probably be 0 width |
Date: |
Tue, 28 Dec 2021 13:40:09 +0100 |
On Tue, Dec 28, 2021 at 11:36 AM Bruno Haible <bruno@clisp.org> wrote:
> I agree that U+D7B0..U+D7FF (Hangul Jamo Extended-B) should be treated like
> U+1160..U+11FF (Hangul Jamo medial and final), per Unicode standard, chapter
> 18
> https://www.unicode.org/versions/Unicode14.0.0/ch18.pdf .
>
> However, I don't think what people have been looking at is the right spot.
Yes. wcwidth() interfaces lack context. wcswidth()-style interfaces
are better in that regard. E.g: perl's Unicode::GCString:
use strict;
use warnings;
binmode(STDOUT, ":utf8");
use Unicode::GCString;
use Text::CharWidth qw(mbwidth mbswidth);
sub string_info {
my $s = shift;
my $gc = Unicode::GCString->new($s);
print "$s : GCString->columns: ", $gc->columns, " : mbswidth:
", mbswidth($s), "\n";
for (my $i = 0; $i < length($s); $i++) {
my $c = substr($s,$i,1);
my $cgc = Unicode::GCString->new($c);
print "\t$c : GCString->columns: ", $cgc->columns, " :
mbswidth: ", mbswidth($c), " : mbwidth: ", mbwidth($c), "\n";
}
}
string_info("\x{1100}\x{d7b0}\x{d7fb}\x{1101}\x{d7c0}\x{d7c2}\x{d7d0}");
string_info("\x{1100}\x{200b}\x{d7b0}\x{200b}\x{d7fb}\x{1101}\x{200b}\x{d7c0}\x{200b}\x{d7c2}\x{200b}\x{d7d0}");
The above script results in:
ᄀힰퟻᄁퟀퟂퟐ : GCString->columns: 4 : mbswidth: 4
ᄀ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
ힰ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟻ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄁ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
ퟀ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟂ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ퟐ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄀힰퟻᄁퟀퟂퟐ : GCString->columns: 14 : mbswidth: 4
ᄀ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ힰ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟻ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
ᄁ : GCString->columns: 2 : mbswidth: 2 : mbwidth: 2
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟀ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟂ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
: GCString->columns: 0 : mbswidth: 0 : mbwidth: 0
ퟐ : GCString->columns: 2 : mbswidth: 0 : mbwidth: 0
(The line with GCString->columns: 14 should have separate Jamos)
> 2) People argue about the use of these Hangul Jamo characters when
> they form a complete Hangul syllable, and that in this case the
> total width should be 2, and therefore 2 = 2 + medial + final the
> medial and final parts should have width 0.
>
> But in this case people would be using a precomposed Hangul syllable.
The Mac OS X filesystem stores filenames as NFD, which would separate
syllables into component Jamos. See:
https://github.com/neovim/neovim/issues/4476
>
> What I am more concerned about: When you look at the code charts
> https://www.unicode.org/charts/PDF/U1100.pdf
> https://www.unicode.org/charts/PDF/UD7B0.pdf
> you see that there are glyphs.
> - In which circumstances are these characters used individually?
> Maybe in a text book for Korean children?
> - How are they supposed to be rendered in these situations? Surely
> as glyphs of width 2, no?
To render as separate components, there are several options:
- Use the non-conjoining forms from the Hangul Compatibility Jamo:
U+3130–U+318F block. It covers the Jamo in modern use, from the
standard e KS X 1001:1998. It doesn't cover archaic Jamo.
- Use the filler choseong (initial) U+115F and jungseong (medial)
U+1160 Jamo as appropriate, to create a syllable with only the
required Jamo displayed. The font may still squeeze the Jamo in a
corner.
- Use non-Korean to separate Jamo, e.g. U+200B zero width space or
U+2060 word joiner. Here we have a problem.
>
> In the end, it comes down to: What is the more frequent context for
> these characters?
>
Ideally, everyone would send complete strings, or at least complete
(extended?) grapheme clusters to functions like wcswidth() or
u32_width(), and this functions would take context into account, like
perl's Unicode::GCString does. Since wcwidth/g_unichar_*/uc_width are
widely used, sometimes results are going to be wrong. But I don't
really know if filenames in NFD causing trouble and decomposed Hangul
taking 3 or 4 columns are more common than trying to use separate Jamo
in terminal emulators, though I suspect so.