[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] iconv issue
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] iconv issue |
Date: |
Sat, 01 Oct 2016 17:23:45 +0200 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hi,
Hi,
Kenneth Nellis wrote on 2016-06-10:
> $ file f
> f: exported SGML document, UTF-8 Unicode (with BOM) text, with CRLF line
> terminators
> ...
> Accordingly, it seems strange, perhaps a bug?, that the former of the
> following two lines doesn't work, but the latter does:
>
> $ cat f | iconv -f UTF-8 -t Latin1 > x
> iconv: (stdin):1:0: cannot convert
> $ cat f | iconv -f UTF-8 -t UTF-16 | iconv -f UTF-16 -t Latin1 > x
> $
The output of the 'file f' command shows that the contents of f starts with a
U+FEFF character. According to RFC 3629 [1] section 6:
"It is therefore RECOMMENDED to avoid stripping an initial
U+FEFF interpreted as a signature without a good reason, to ignore it
instead of stripping it when appropriate (such as for display) and to
strip it only when really necessary."
It is therefore OK that iconv does not strip away the leading U+FEFF character.
The seconds line succeeds because the 'iconv -f UTF-8 -t UTF-16' command
leaves the U+FEFF character in place and the 'iconv -f UTF-16 ...' command
then strips it away. This is because UTF-16 handles the byte-order mark.
Yes, I know such BOMs frequently occur in XML files written by Windows tools,
because some Windows developers have/had the mindset that a BOM was a good
thing. When in fact it is a bad thing (in the case of UTF-8).
Bruno
[1] https://tools.ietf.org/html/rfc3629
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [bug-gnu-libiconv] iconv issue,
Bruno Haible <=