[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] problem with iso-8859-8 encoding
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] problem with iso-8859-8 encoding |
Date: |
Tue, 26 Feb 2008 03:25:28 +0100 |
User-agent: |
KMail/1.5.4 |
Hello,
Alexander Sirotkin wrote:
> I find it hard to believe, but apparently iconv have a problem converting
> iso-8859-8 (hebrew) to any other encoding, for instance UTF-8. Hebrew
> letters in the result appear in the revere order.
As you can read in [1], [2], text in ISO 8859-8 is "sometimes in logical,
sometimes in visual order". Therefore the request to convert ISO-8859-8 to
UTF-8 is already ambiguous per se. Some others [3] say that ISO-8859-8 is always
visual... Oh well.
Additionally, conversion between visual and logical order requires an
arbitrary amount of memory (whose size depends on the input); this is
does not fit into the way iconv is implemented in GNU libc and in GNU libiconv.
For these reasons, GNU libc and GNU libiconv don't implement this reordering.
Fribidi implements reordering from logical to visual order.
The only free software (that I know of) that does reordering of ISO-8859-8
from visual to logical is ICU, and its documentation [4] says:
"Legacy systems frequently stored text in visual order to avoid
reordering for display. When exchanging data with such systems for
processing in Unicode it is necessary to reorder the data from visual
order to logical order and back. Such not-for-display transformations
are sometimes referred to as "storage layout" transformations.
There are two problems with an "inverse reordering" from visual to
logical order: There may be more than one logical order of text that
results in the same display (logical-to-visual reordering is a many-to-one
function), and there is no standard algorithm for it. ICU's BiDi API
provides a setting for "inverse" operation that modifies the standard
Unicode Bidi algorithm. However, it may not always produce the expected
results. Bidirectional data should be converted to Unicode and reordered
to logical order only once to avoid roundtrip losses. Just as it is best
to never convert to non-Unicode charsets, data should not be reordered
from logical to visual order except for display and printing."
Bruno
[1] http://en.wikipedia.org/wiki/ISO_8859-8
[2] http://en.wikipedia.org/wiki/ISO-8859-8-I
[3] http://www.w3.org/TR/2002/WD-xhtml2-20021211/mod-bidi.html
[4] http://www.icu-project.org/userguide/icu.pdf