lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] Dumps Unicode file in broken encoding.


From: Thomas Dickey
Subject: Re: [Lynx-dev] Dumps Unicode file in broken encoding.
Date: Mon, 29 Sep 2008 07:02:14 -0400 (EDT)

On Mon, 29 Sep 2008, Atsuhito Kohda wrote:

Hi all,

I got the following bug report in the Debian BTS (Bug#498985).
As I have no knowledge on this, I'd like to forward the report
to this lists.

This seems to be what I discussed with Plessy week-before-last:

        I explained the expected behavior.
        lynx seems to be matching that.
        He replied that it did not do that before.
        I verified that it worked as expected in 2.8.5.

Without some charset in the document, or override via command-line or lynx configuration, the file will be treated as ISO-8859-1. He seems to be expecting lynx to treat it as UTF-8.

(without some more details, I don't know where to look).


On Mon, 15 Sep 2008 16:10:38 +0900, Charles Plessy wrote:

I have severe problems when converting HTML messages with Lynx while
using Mutt, and it seems to me that the reason is that the output
encoding is broken. Here is a simple example:

aqwa???~???$ cat test.html
<ul>
<li>??</li>
<li>??</li>
</ul>

aqwa???~???$ hexdump -C test.html
00000000  3c 75 6c 3e 0a 3c 6c 69  3e c3 a9 3c 2f 6c 69 3e  |<ul>.<li>..</li>|
00000010  0a 3c 6c 69 3e c3 a0 3c  2f 6c 69 3e 0a 3c 2f 75  |.<li>..</li>.</u|
00000020  6c 3e 0a                                          |l>.|
00000023

aqwa???~???$ lynx.cur --dump test.html
     * ??
     *


aqwa???~???$ lynx.cur --dump test.html > test.txt

aqwa???~???$ hexdump -C test.txt
00000000  20 20 20 20 20 2a 20 c3  a9 0a 20 20 20 20 20 2a  |     * ...     *|
00000010  20 c3 0a 0a                                       | ...|
00000014

Here are the expected files in latin and unicode encodings:

aqwa???~???$ cat test.unicode.txt
     * ??
     * ??


aqwa???~???$ hexdump -C test.unicode.txt
00000000  20 20 20 20 20 2a 20 c3  a9 0a 20 20 20 20 20 2a  |     * ...     *|
00000010  20 c3 a0 0a 0a                                    | ....|
00000015

aqwa???~???$ cat test.iso.txt
     *
     *


aqwa???~???$ hexdump -C test.iso.txt
00000000  20 20 20 20 20 2a 20 e9  0a 20 20 20 20 20 2a 20  |     * ..     * |
00000010  e0 0a 0a                                          |...|
00000013

So apparently, ?????????? is C3A0 in UTF-8, E0 in ISO 8859-1, but Lynx dumps it 
as
C3. This causes encoding misdetection, and many downstream problems.

Thanks in advance.

Regards,                        2008-9-29(Mon)

--
Debian Developer - much more I18N of Debian
Atsuhito Kohda <kohda AT debian.org>
Department of Math., Univ. of Tokushima


--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

reply via email to

[Prev in Thread] Current Thread [Next in Thread]