[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Trouble converting to Japanese charsets
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Trouble converting to Japanese charsets |
Date: |
Fri, 6 Nov 2009 10:25:12 +0100 |
User-agent: |
KMail/1.9.9 |
Hi,
Jeff Diehl wrote:
> I am having trouble converting the string "配信リスト名テスト�" from UTF-8 to
> SHIFT-JIS, EUC-JP and ISO-2022-JP using libiconv (version 1.13.1).
> Here is a hex representation of the source string:
>
> $ xxd utf8.txt
> 0000000: e985 8de4 bfa1 e383 aae3 82b9 e383 88e5 ................
> 0000010: 908d e383 86e3 82b9 e383 88e2 91a0 ..............
Or, to reproduce it:
$ printf
'\xe9\x85\x8d\xe4\xbf\xa1\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88\xe5\x90\x8d\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe2\x91\xa0'
> utf8.txt
> The problem seems to be the "circled digit one" character (Unicode
> 0x2460). Can you please explain why these conversion fail?
This fails because the character U+2460 is not in the target encoding.
For a reference to the various Japanese encodings, please refer to
http://www.haible.de/bruno/charsets/conversion-tables/Japanese.html
> I was expecting to see libiconv generate the following strings:
>
> $ xxd sjis.txt
> 0000000: 947a 904d 838a 8358 8367 96bc 8365 8358 .z.M...X.g...e.X
> 0000010: 8367 8740 .g.@
0x8740 is not in the Shift_JIS range: In Shift_JIS there are no
characters between 0x84BE and 0x889F.
Probably you mean the CP932 encoding, which is the Shift_JIS-like
encoding used by Windows. libiconv supports it:
$ iconv -f UTF-8 -t CP932 < utf8.txt | hd
000000 94 7A 90 4D 83 8A 83 58 83 67 96 BC 83 65 83 58 .z.M...X.g...e.X
000010 83 67 87 40 .g.@
> $ xxd euc-jp.txt
> 0000000: c7db bfae a5ea a5b9 a5c8 ccbe a5c6 a5b9 ................
> 0000010: a5c8 ada1 ....
0xada1 is not in the EUC-JP range: In EUC-JP there are no characters
between 0xA8C0 and 0xB0A1.
I don't know which of the many EUC-JP variants you were expecting.
I would recommend to stick with plain standardized EUC-JP, if you
value interoperability and don't like loss of data.
> $ xxd 2022.txt
> 0000000: 1b24 4247 5b3f 2e25 6a25 3925 484c 3e25 .$BG[?.%j%9%HL>%
> 0000010: 4625 3925 482d 211b 284a F%9%H-!.(J
ISO-2022-JP is not suitable for Japanese: It does not even contain
Katakana characters. I don't know why you would want to use this
encoding. Nobody uses it.
An encoding similar to ISO-2022-JP that is still sometimes used for
email or web pages is ISO-2022-JP-2. libiconv supports it:
$ iconv -f UTF-8 -t ISO-2022-JP-2 < utf8.txt | hd
000000 1B 24 42 47 5B 3F 2E 25 6A 25 39 25 48 4C 3E 25 .$BG[?.%j%9%HL>%
000010 46 25 39 25 48 1B 24 41 22 59 1B 28 42 F%9%H.$A"Y.(B
Your 2022.txt is not valid in any known encoding.
$ printf
'\x1b\x24\x42\x47\x5b\x3f\x2e\x25\x6a\x25\x39\x25\x48\x4c\x3e\x25\x46\x25\x39\x25\x48\x2d\x21\x1b\x28\x4a'
> 2022.txt
$ iconv -f ISO-2022-JP -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-JP-1 -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-JP-2 -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:18: cannot convert
$ iconv -f ISO-2022-CN -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert
$ iconv -f ISO-2022-CN-EXT -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert
$ iconv -f ISO-2022-KR -t UTF-8 < 2022.txt > /dev/null
/arch/x86-linux/gnu-inst-libiconv/1.13/bin/iconv: (stdin):1:0: cannot convert
Bruno