[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gnu-libiconv] iconv not catching bad bytes for ISO-8859-1
From: |
Bruno Haible |
Subject: |
Re: [bug-gnu-libiconv] iconv not catching bad bytes for ISO-8859-1 |
Date: |
Fri, 14 Aug 2015 12:28:14 +0200 |
User-agent: |
KMail/4.8.5 (Linux/3.8.0-44-generic; KDE/4.8.5; x86_64; ; ) |
Hi,
Kenneth Reid Beesley wrote on 13.08.2015:
> Problem: iconv not catching/detecting bad bytes when converting from a file
> alleged to be ISO-8859-1 (but it’s not)
>
> Dear All,
>
> I’m using iconv (GNU libiconv 1.14), written by Bruno Haible, in a SUSE Linux
> system.
> Also iconv (GNU libiconv 1.11) on a separate machine (OS X 10.10.4).
>
> 1. I create a file, input1252.txt, that contains hex byte values x91 and
> x92. This file is encoded in CP1252,
> where x91 and x92 are legal/defined bytes.
>
> These two bytes are not defined in ISO-8859-1
>
> 2. I run the following script
>
> iconv -f ISO-8859-1 -t UTF-8 —byte-subst=“<PROBLEM: 0x%x>”
> —unicode-subst=“<PROBLEM: U+%04X>” input1252.txt > out.txt
>
> i.e. telling iconv (incorrectly) that the input file is Latin 1, and telling
> it to convert it
> to UTF-8. I expect the x91 and x92 bytes to be recognized as
> not-legal-in-Latin1,
> and I expect to see <PROBLEM: 0x91> and <PROBLEM: 0x92> in the out.txt file.
Your expectation is ill-founded. ISO-8859-1 has no unassigned code points.
That is, all 256 byte values are valid.
Witness:
1) Wikipedia https://en.wikipedia.org/wiki/ISO/IEC_8859-1 says
"In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the extra
hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet.
This map assigns the C0 and C1 control characters to the unassigned code
values thus provides for 256 characters via every possible 8-bit value."
2) When you go to
http://www.haible.de/bruno/charsets/conversion-tables/index.html
-> ISO-8859-* -> ISO-8859-1, you can see that all charset converters
from different vendors implement the ISO-8859-1 <--> Unicode conversion
in the same way.
Probably you know that the byte values 0x7F..0x9F in ISO-8859-1 don't
correspond to *graphic* characters in ISO-8859-1 (while some of them
correspond to graphic characters in Windows-1252). But iconv is the
wrong tool to make a distinction between graphic and non-graphic characters.
Bruno