bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] GNU libunistring's Korean canonical composition b


From: Bruno Haible
Subject: Re: [bug-libunistring] GNU libunistring's Korean canonical composition bug report.
Date: Sat, 18 Nov 2017 23:39:20 +0100
User-agent: KMail/5.1.3 (Linux/4.4.0-98-generic; KDE/5.18.0; x86_64; ; )

Hello DaeHyun Sung,

> I found Korean Syllables Canonical Decomposition bug Not fully decompose
> Hangul Syllables.
> Expected: U+D4DB → <U+1111, U+1171, U+11B6> = Full canonical composition
> result. correct!
> Result: U+D4DB → <U+D4CC,U+11B6> = only intermediate step. incorrect

The main entrypoint in libunistring, for canonical decomposition, are the
u*_normalize functions with UNINORM_NFD argument (all declared in <uninorm.h>).

These functions do provide the "full canonical decomposition". This is
verified by this test (in tests/test-u32-nfd.c):

  { /* HANGUL SYLLABLE GEUL */
    static const uint32_t input[]    = { 0xAE00 };
    static const uint32_t expected[] = { 0x1100, 0x1173, 0x11AF };
    ASSERT (check (input, SIZEOF (input), expected, SIZEOF (expected)) == 0);
  }

So, you must be talking about the uc_canonical_decomposition function.
The documentation is indeed not clear whether it returns the "canonical
decomposition" or "full canonical decomposition".

I am using these terms, defined by Unicode:
  * Unicode 10.0 PDF, page 146 (PDF page 182):
    "Full Canonical Decomposition. The full canonical decomposition for a
     Unicode character is defined as the recursive application of canonical
     decomposition mappings. The canonical decomposition mapping of an
     LVT_Syllable contains an LVPart which itself is a precomposed Hangul
     syllable and thus must be further decomposed."
  * http://unicode.org/reports/tr15/#Description_Norm
    "To transform a Unicode string into a given Unicode Normalization Form,
     the first step is to fully decompose the string. ... Full decomposition
     involves recursive application of the Decomposition_Mapping values,
     because in some cases a complex composite character may have a
     Decomposition_Mapping into a sequence of characters, one of which may
     also have its own non-trivial Decomposition_Mapping value."

We have the choice:
  (a) Change uc_canonical_decomposition to do full canonical decomposition.
  (b) Introduce a new function uc_full_canonical_decomposition, leaving
      uc_canonical_decomposition as is.
  (c) Document how to obtain the full canonical decomposition of a character.

(a) is not good because the code in uninorm/u-normalize-internal.h (which is
    the core of the u*_normalize functions) already does a recursion, and
    even has a comment why this recursion pass is useful, namely to keep the
    table sizes small.
(b) is not good because we already have function to do full canonical
    decomposition on strings (u32_normalize with argument UNINORM_NFD),
    and there is not useful to introduce new API just for strings of length 1.
(c) is what I have chosen to do. Pushed:
    
http://git.savannah.gnu.org/gitweb/?p=libunistring.git;a=commitdiff;h=4e49b798264d01433f64137fb525f507778fb781

> Korean Alphabet Hangul Canonical Decomposition Explain
> Hangul elements are commonly referred to as jamo(자모/字母), meaning “alphabet”
> 
> Korean has special term for the jamo that are used to construct hangul
> syllable, depending on where in the syllable they appear:
> - Choseong(초성/初聲) for the initial sound, usually a consonant
> - Jungseong(중성/中聲) for the middle sound, usually a vowel
> - Jongseong(종성/終聲) for the final sound, usually a consonant
> 
> Hangul syllables are the characters that are used to express contemporary
> Korean texts in writing.
> 
> ex1) Decomposition of hangul syllable
> Unicode codepoint: U+AC00
> Hangul(한글) ‘가’
> jamo(자모/字母): ㄱ plus ㅏ
> choseong(초성/初聲): ㄱ (codepoint: U+1100)
> jungseong(중성/中聲): ㅏ(codepoint: U+1161)
> 
> Selected Hangul syllable ‘가’(U+AC00)
> Present
> Canonical decomposition:
> ㄱ U+1100 HANGUL CHOSEONG KIYEOK
> ㅏ U+1161 HANGUL JUNGSEONG A
> 
> Expected result
> Canonical decomposition:
> ㄱ U+1100 HANGUL CHOSEONG KIYEOK
> ㅏ U+1161 HANGUL JUNGSEONG A
> 
> Hangul Choseong:ᄀ
> Hangul Jungseong:ᅡ
> 
> ex2) Decomposition of hangul syllable
> Unicode code point: U+AC01
> Hangul(한글) ‘각’
> jamo(자모/字母):  ‘ᄀ’  plus ‘ᅡ’  plus ‘ᆨ’
> choseong(초성/初聲):ㄱ (codepoint: U+1100)
> jungseong(중성/中聲):ㅏ(codepoint: U+1161)
> jongseong(종성/終聲):ᆨ (codepoint: U+11A8)
> 
> 
> Selected Hangul syllable ‘각’(U+AC01)
> Present
> Canonical decomposition:
> ‘가 U+AC00 HANGUL SYLLABLE GA'   It's intermediate step.
> 'ᆨ U+11A8 HANGUL JONGSEONG KIYEOK'
> 
> Expected Result
> Canonical decomposition(Fully):
> ㄱ U+1100 HANGUL CHOSEONG KIYEOK
> ㅏ U+1161 HANGUL JUNGSEONG A
> ᆨ U+11A8 HANGUL JONGSEONG KIYEOK
> 
> Hangul Choseong:ᄀ
> Hangul Jungseong:ᅡ
> Hangul Jongseong:ᆨ

You don't need to explain this. We are all familiar with this
(from the Unicode book, from Ken Lunde's book, or from Wikipedia [1]).

Best regards,

            Bruno

[1] 
https://en.wikipedia.org/wiki/Korean_language_and_computers#Hangul_in_Unicode




reply via email to

[Prev in Thread] Current Thread [Next in Thread]