[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp12
From: |
Bruno Haible |
Subject: |
[bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation |
Date: |
Sun, 10 Apr 2011 13:55:53 +0200 |
User-agent: |
KMail/1.9.9 |
[CCing bug-gnu-libiconv]
Hello Ron,
> Attached is a ZIP with three files which illustrate the problem.
>
> The source "heb1.utf" is converted to heb1.cp1255:
>
> iconv -f utf-8 -t cp1255 heb1.utf > heb1.cp1255
>
> and converted back to UTF8:
>
> iconv -f cp1255 -t utf-8 heb1.cp1255 > heb2.utf
>
>
> Note that the character sequence: 05d9 05bc 05b9 is re-converted to
> fb39 05b9
Yes, the sequence of characters
U+05D9 HEBREW LETTER YOD
U+05BC HEBREW POINT DAGESH
is canonically equivalent to
U+FB39 HEBREW LETTER YOD WITH DAGESH
For explanation of "canonical equivalence" and the normalization forms NFC and
NFD, see Unicode UAX #15 <http://www.unicode.org/reports/tr15/>.
In particular:
"The W3C Character Model for the World Wide Web, Part II: Normalization
[CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition)
recommend using Normalization Form C for all content, because this form
avoids potential interoperability problems arising from the use of
canonically equivalent, yet different, character sequences in document
formats on the Web. See the W3C Requirements for String Identity,
Matching, and String Indexing [CharReq] for more background."
[CharNorm] = http://www.w3.org/TR/charmod-norm/
Normalization form NFC is recommended everywhere, and 'iconv' (both from
GNU libiconv and from GNU libc) produces this normalization form.
> I am using "vim" to edit Hebrew texts, and have been bothered for a
> while with a specific problem.
>
> The problem is that some sequences map to Unicode composited characters,
> which makes editing (specifically searching!) more difficult that it
> should be.
Searching for substrings, and meeting user expectations while doing that,
has indeed become more complex that before Unicode, see Unicode TR #10
<http://www.unicode.org/reports/tr10/>, and it is the duty of the
programs ("vim" in this case) to meet these user expectations.
Maybe GNU libunistring <http://www.gnu.org/software/libunistring/> may help
the vim implementors in doing this.
> While it may be correct, it really makes editing very difficult. Is
> there a way to change this behavior of iconv?
You can, of course, convert your files from NFC to NFD before editing, and
convert them back from NFD to NFC after editing. A ready-made program for
doing so is 'uconv', part of ICU. "uconv -f utf8 -t utf8 -x nfc" and
"uconv -f utf8 -t utf8 -x nfd".
Bruno
--
In memoriam Hendrik Nicolaas Werkman
<http://en.wikipedia.org/wiki/Hendrik_Nicolaas_Werkman>
- [bug-gnu-libiconv] Re: Question regarding libiconv 1.13 and Hebrew (cp1255 -> utf8) translation,
Bruno Haible <=