[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b
From: |
Seikoh NISHITA |
Subject: |
Re: [bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b from UTF-8 to ISO-2022-JP |
Date: |
Tue, 24 Mar 2015 18:01:14 +0900 |
Thank you for quick reply.
2015-03-24 12:22 GMT+09:00 Bruno Haible <address@hidden>:
> Hello,
>
>> ISO-2022-JP is one of the popular character encoding schemes for email
>> texts in Japan.
>
> I don't think that it is still popular, for 20 or 30 years already, as it
> cannot encode half-width Katakana characters (it can only encode Katakana as
> full-width characters, which is extremely unusual).
Really? It is surprising for me very much.
Please show me any information about popularity of ISO-2022-JP and
ISO-2022-JP-{2,3}.
> Try ISO-2022-JP-2 or ISO-2022-JP-3 instead. That's why these encodings
> were created.
>
> See https://en.wikipedia.org/wiki/ISO/IEC_2022#ISO.2FIEC_2022_character_sets
>
>> I report incorrect conversion by iconv w.r.t. ISO-2022-JP.
I know both ISO-2022-JP-2 and ISO-2022-JP-3.
>> The byte value 0x1b in UTF-8 text is converted to the same byte value
>> in ISO-2022-JP by iconv.
>
> Since the byte value 0x1b is used as escape character in the ISO-2022-*
> family of encodings, and these encodings provide no way to encode a ESC
> character as such, "byte value 0x1b in UTF-8 text" is invalid input for
> such a conversion. In other words, use ASCII without ESC characters,
> or UTF-8 without ESC characters, as input.
>
Yes, I agree with you.
However, I think software developers sometimes drop the input validation
to check existence of invalid ESC characters before conversion.
Because ISO-2022-* should not have invalid ESC characters as you wrote,
and libiconv is one of the basic libraries for developers,
I think libiconv should terminate conversion to ISO-2022-* when it
finds invalid ESC characters.
How do you think about it?
> Bruno
>
Sincerely yours,
Seikoh.
P.S.
I sent a new mail with the text "KON-NICHIHA" of full-width hiragana
characters from some Web mail sites.
and checked the default character encoding scheme.
o Gmail: UTF-8 (base64 encoded)
o Yahoo Mail: ISO-2022-JP
o Outlook in Office365: ISO-2022-JP (switched from "HTML text" to
"plain text")
Next, I sent a mail with same text of half-width katakana characters.
o Gmail: UTF-8 (base64 encoded)
o Yahoo Mail: ISO-2022-JP (converted to full-width katakana characters)
o Outlook in Office365: ISO-2022-JP (converted to full-width
katakana characters)
These three mail sites uses UTF-8 or ISO-2022-JP (not ISO-2022-JP-2, or .. -3).
--
------------------------------------------------------
Seikoh Nishita
Department of Computer Science,
Faculty of Engineering, Takushoku University
815-1, Tate-machi
Hachioji city, Tokyo
193-0985, Japan
Tel: +81-42-665-8529, +81-42-665-1441 (ex. 5308)
Fax: +81-42-665-1519
E-Mail: address@hidden
西田 誠幸 (にした せいこう)
〒193-0985 東京都八王子市館町815-1
拓殖大学工学部情報工学科
Tel: 042-665-8529, 042-665-1441 (ex. 5308)
Fax: 042-665-1519
E-Mail: address@hidden