[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b fro
From: |
Seikoh NISHITA |
Subject: |
[bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b from UTF-8 to ISO-2022-JP |
Date: |
Tue, 24 Mar 2015 11:45:05 +0900 |
ISO-2022-JP is one of the popular character encoding schemes for email
texts in Japan.
I report incorrect conversion by iconv w.r.t. ISO-2022-JP.
The byte value 0x1b in UTF-8 text is converted to the same byte value
in ISO-2022-JP by iconv.
This conversion does not follow the specification of ISO-2022-JP.
As a result, the round-trip conversion between UTF-8 and ISO-2022-JP
is impaired.
Although escape sequences in UTF-8 text look strange, such text might
be generated by a software
that unexpectedly accepts escape sequences as user input and
concatenates them with embedded character sequence.
The following is what I tried with iconv version 1.11 and a terminal
emulater on Mac OS X.
$ echo -en "\x1b" > a.txt
$ od -tx1 a.txt
0000000 1b
0000001
$ iconv -f UTF-8 -t ISO-2022-JP a.txt >b.txt
$ od -tx1 b.txt
0000000 1b
0000001
$
(the byte value 0x1b in UTF-8 text is converted to the same byte
value in ISO-2022-JP by iconv.)
$ echo -en "\x1b\x24\x42\x46\x7c" > x.txt
$ cat x.txt
BF|
$ iconv -f UTF-8 -t ISO-2022-JP x.txt > y.txt
$ iconv -f ISO-2022-JP -t UTF-8 y.txt > z.txt
$ cat z.txt
日
(the round-trip conversion between UTF-8 and ISO-2022-JP fails in this case.)
The last character is Japanese Kanji character Nichi, which is found
at following Web page:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=65E5
Actually the text x.txt and y.txt has the same byte sequence, 0x1b 24 42 46 7c.
But the sequence is interpreted differently in UTF-8 and ISO-2022-JP.
UTF-8 interpretation:
ESC(1b) $(24) B(42) F(46) |(7c)
ISO-2022-JP interpret.:
escape sequence (1b 24 42) Japanese character Nichi (46 7c)
According to RFC 1468 that defines ISO-2022-JP, escape characters are
only used as the start characters of escape sequences in order to
switch character set.
[Quotation of Section "Formal Syntax" in RFC 1468]
single-byte-seq = ESC "(" ( "B" / "J" )
double-byte-seq = ESC "$" ( "@" / "B" )
single-byte-char = <any 7BIT, including bare CR & bare LF, but NOT
including CRLF, and not including ESC, SI, SO>
--
------------------------------------------------------
Seikoh Nishita
Department of Computer Science,
Faculty of Engineering, Takushoku University
815-1, Tate-machi
Hachioji city, Tokyo
193-0985, Japan
Tel: +81-42-665-8529, +81-42-665-1441 (ex. 5308)
Fax: +81-42-665-1519
E-Mail: address@hidden
西田 誠幸 (にした せいこう)
〒193-0985 東京都八王子市館町815-1
拓殖大学工学部情報工学科
Tel: 042-665-8529, 042-665-1441 (ex. 5308)
Fax: 042-665-1519
E-Mail: address@hidden
- [bug-gnu-libiconv] iconv incorrectly converts escape characters 0x1b from UTF-8 to ISO-2022-JP,
Seikoh NISHITA <=