[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] iconv in terminal and cpp differs
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] iconv in terminal and cpp differs |
Date: |
Wed, 04 Oct 2017 11:53:59 +0200 |
User-agent: |
KMail/5.1.3 (Linux/4.4.0-96-generic; KDE/5.18.0; x86_64; ; ) |
Hi,
To investigate these issues, it is useful to display the byte sequences in
hexadecimal form. You could use 'od -t x1' to do this; I prefer 'hd',
implemented as
=====================================================================
#!/bin/sh
hexdump -e '"%06.6_ax " 16/1 "%02X "' -e '" " 16/1 "%_p" "\n"' "$@"
=====================================================================
> When we run this as command line as shown below
>
> iconv -f gb18030 -t utf-8 GB18030.txt > utf-8.txt
>
> utf-8.txt has - 我爱北京天安门,天安门上太阳升
The original bytes that you input in this conversion were:
$ echo '我爱北京天安门,天安门上太阳升' | iconv -t UTF-8 -t GB18030 | hd
000000 CE D2 B0 AE B1 B1 BE A9 CC EC B0 B2 C3 C5 A3 AC ................
000010 CC EC B0 B2 C3 C5 C9 CF CC AB D1 F4 C9 FD 0A ...............
> but when are run the attached source code’s binary by inputting like below
> in same terminal,
>
> [Encoder GB18030_String “FROM” “TO”]
> Encoder ÎÒ°®±±¾©Ìì°²ÃÅ£¬Ìì°²ÃÅÉÏÌ«ÑôÉý “GB18030” “UTF-8”
>
> We are getting
>
> 脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒
The bytes that you gave as input in this conversion were:
$ echo '脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒' | iconv -t UTF-8 -t GB18030 | hd
000000 C3 8E C3 92 C2 B0 C2 AE C2 B1 C2 B1 C2 BE C2 A9 ................
000010 C3 8C C3 AC C2 B0 C2 B2 C3 83 C3 85 C2 A3 C2 AC ................
000020 C3 8C C3 AC C2 B0 C2 B2 C3 83 C3 85 C3 89 C3 8F ................
000030 C3 8C C2 AB C3 91 C3 B4 C3 89 C3 BD 0A .............
As you can see, there are approximately twice as many bytes here,
and more precisely, the input you gave here is UTF-8 encoded. Look at this:
$ echo '脦脪掳庐卤卤戮漏脤矛掳虏脙脜拢卢脤矛掳虏脙脜脡脧脤芦脩么脡媒' | iconv -t UTF-8 -t GB18030 | iconv -f
UTF-8 -t ISO-8859-1 | hd
000000 CE D2 B0 AE B1 B1 BE A9 CC EC B0 B2 C3 C5 A3 AC ................
000010 CC EC B0 B2 C3 C5 C9 CF CC AB D1 F4 C9 FD 0A ...............
Here we find your original input again!
So, there was an undesired conversion from ISO-8859-1 to UTF-8 on your input.
I would guess that you are on Linux, and this conversion happened when you did a
copy&paste of the snippet, from a file into a terminal window.
Bruno