[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Lynx-dev] fix for decoding utf-8 in CDATA sections
From: |
Hiltjo Posthuma |
Subject: |
[Lynx-dev] fix for decoding utf-8 in CDATA sections |
Date: |
Thu, 27 Jul 2023 22:25:13 +0200 |
Hi,
I use lynx to convert HTML to plain-text, but noticed an issue where part of
the output is missing with UTF-8 in CDATA sections.
Below is a small test-case to reproduce it:
<p>Works correctly:</p>
<p>a’b</p>
<p>Doesn't work correctly:</p>
<p><![CDATA[a’b]]></p>
This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
I use the following command to convert HTML to text:
lynx -stdin -dump \
-underline_links -image_links \
-display_charset="utf-8" -assume_charset="utf-8"
My system information:
I tested on the latest lynx-cur: lynx2.9.0dev.12
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
What I found:
I think it only prints the first byte instead of printing the processed
codepoint (clong). I noticed in the file WWW/Library/Implementation/SGML.c
there is a similar case for comments for example for "S_comment_put_c:".
Below is a patch. I'm not sure it covers all lynx options though. I hope it
does:
diff --git a/WWW/Library/Implementation/SGML.c
b/WWW/Library/Implementation/SGML.c
index 2534606..8632670 100644
--- a/WWW/Library/Implementation/SGML.c
+++ b/WWW/Library/Implementation/SGML.c
@@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in)
me->state = S_text;
break;
}
- HTChunkPutc(string, c);
- break;
+ if (me->T.decode_utf8) {
+ HTChunkPutUtf8Char(string, clong);
+ } else {
+ HTChunkPutc(string, c);
+ }
+ break;
case S_sgmlent: /* Expecting ENTITY. - FM */
if (!me->first_dash && c == '-') {
HTChunkPutc(string, c);
Thank you for lynx,
--
Kind regards,
Hiltjo
- [Lynx-dev] fix for decoding utf-8 in CDATA sections,
Hiltjo Posthuma <=