[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] Generating utf-16 with BOM and specified endianne
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] Generating utf-16 with BOM and specified endianness? |
Date: |
Tue, 24 Jan 2012 03:46:38 +0100 |
User-agent: |
KMail/4.7.4 (Linux/3.1.0-1.2-desktop; KDE/4.7.4; x86_64; ; ) |
Hi,
Keith Thompson wrote:
> The "iconv" command supports "utf-16", "utf-16be", and "utf-16le"
> formats (among many many others).
>
> For output (e.g., "echo hello | iconv -f ascii -t utf-16be"), the
> utf-16be format is big-endian UTF-16 with no BOM (byte order mark);
> similarly, utf-16le is little-endian UTF-16 with no BOM.
>
> The "-t utf-16" option causes iconv to generate UTF-16 output *with*
> a BOM -- but the endianness is unspecified.
Yes, this is as specified in RFC 2781.
> A few experiments seem
> to indicate that the generated UTF-16 uses the same endianness as
> the current system
Yes, this is the case for glibc's iconv. On an x86 system:
$ echo abc | /usr/bin/iconv -t utf-16 | od -t x1
0000000 ff fe 61 00 62 00 63 00 0a 00
0000012
On a PowerPC system:
$ echo abc | /usr/bin/iconv -t utf-16 | od -t x1
0000000 fe ff 00 61 00 62 00 63 00 0a
0000012
> but I've seen one report (which I'm trying to
> verify) of it generating big-endian output on a little-endian system
> (x86 Mac OSX).
Yes, iconv on MacOS X is derived from GNU libiconv, and its conversion
to UTF-16 happens to always produce big-endian output.
> There doesn't seem to be a way to tell iconv to generate UTF-16
> output with a BOM with a specified endianness.
Correct. It doesn't matter: If the receiver/decoder pays attention to
the BOM, then it can cope with either format.
> (For example, the preferred format for Unicode text on MS Windows
> is little-endian UTF-16 with BOM).
That was about 10 years ago. Meanwhile the preferred format for Unicode
text on Windows is UTF-8.
> There are workarounds, such as prepending the BOM manually
Yes, and it is simple enough, no?
> it would be nice to be able to just specify the format directly.
> Suggested syntax:
>
> iconv -f ... -t utf-16bebom
> iconv -f ... -t utf-16lebom
No, that would not be "nice". Introducing a new encoding or a new name
for an existing encoding is a BAD IDEA. The effect would be that some
conversion softwares would understand your new names "utf-16bebom",
"utf-16lebom". The resulting interoperabiltiy problems would exert pressure
on the other softwares to support these encoding names as well. For
several years, while the software authors have not all extended their
converters, people would face problems. So much trouble, for extremely
little gain!
> If there's interest in this feature, but insufficient time to
> implement it, I'd consider implementing it myself and submitting
> a patch.
It wouldn't be feature, when considering the big picture.
> Another question: There seem to be two separate "iconv" commands.
> /usr/bin/iconv on my Ubuntu 11.04 system is part of the libc-bin
> package, which seems to be distinct from, but similar to, the
> version provided by the libiconv package. What is the relationship
> between them?
GNU libc and GNU libiconv are different packages, each implementing
the iconv() C API and the 'iconv' command. GNU libiconv is not meant
to be used on glibc systems, because the iconv from the glibc is entirely
sufficient.
Both have very similar encoding names (aliases) and encoding tables,
and share the same test suite. But the code is different, focusing on
extensibility and speed in the GNU libc case and on portability and size
in the GNU libiconv case.
> Would changes made to one of them show up (eventually) in the other?
No, normally not. You need to report bugs or change requests separately.
Bruno