[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Fwd: Supporting Combining Diacritical Marks
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Fwd: Supporting Combining Diacritical Marks |
Date: |
Thu, 30 Jun 2022 02:08:22 +0200 |
Christian PERNOT wrote:
> We are using gnu libiconv in our alpine-based container, and we face some
> difficulties with some special characters.
>
> These characters are using Unicode Combining Diacritical Marks like this
> one : https://www.fileformat.info/info/unicode/char/301/index.htm
>
> I didn't know this behavior in Unicode, but it is a diacritical mark that
> is put after the base character, and they should be combined on display.
>
> For example : "é" exists in UTF8 as one character (0xC3 0xA9) :
> https://www.fileformat.info/info/unicode/char/e9/index.htm
> But it may be display the same way by having a "e" without accent (0x65),
> followed by the accent character (0xCC 0x81)
Yes, the first form is called NFC, the second one is called NFD. [1]
Find attached your sample in NFC and NFD, respectively.
Generally, text is exchanged between systems and between applications
in the NFC form [2]. This means, the NFD form is mostly used for internal
processing (e.g. searching, sorting) only.
> There is no difference on display, but iconv won't accept to convert to
> ascii with or without transliteration
>
> here is my attempt :
>
> ~/local/bin/iconv -f UTF-8 -t ASCII//TRANSLIT ~/src/iconv.txt
>
> Capture d'e
> /home/cpernot/local/bin/iconv: /home/cpernot/src/iconv.txt:1:11: ne peut
> convertir
Different iconv implementation have different results. Let's see with glibc
first:
$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFC.txt
Capture ecran 2020-03-24 a 10.51.25.png
$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFD.txt
Capture ecran 2020-03-24 a 10.51.25.png
The output is the same.
Whereas libiconv produces:
$ iconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFC.txt
Capture 'ecran 2020-03-24 `a 10.51.25.png
$ libiconv -f UTF-8 -t ASCII//TRANSLIT < iconv_NFD.txt
Capture e
/.../bin/iconv: (stdin):1:9: cannot convert
As you can see, GNU libiconv attempts to represent, not lose, the accent.
But in NFD form, since the accent comes after the letter, this would
require more complex processing. Since the advice is to pass only NFC
text to programs, it is not really worth it.
Bruno
[1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
[2] https://www.w3.org/International/questions/qa-html-css-normalization.en.html
iconv_NFC.txt
Description: Text document
iconv_NFD.txt
Description: Text document