[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections
From: |
Hiltjo Posthuma |
Subject: |
Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections |
Date: |
Tue, 3 Oct 2023 23:29:07 +0200 |
On Thu, Jul 27, 2023 at 10:25:13PM +0200, Hiltjo Posthuma wrote:
> Hi,
>
> I use lynx to convert HTML to plain-text, but noticed an issue where part of
> the output is missing with UTF-8 in CDATA sections.
>
> Below is a small test-case to reproduce it:
>
> <p>Works correctly:</p>
> <p>a’b</p>
>
> <p>Doesn't work correctly:</p>
> <p><![CDATA[a’b]]></p>
>
> This byte sequence for the UTF-8 codepoint is: printf '\342\200\231'
>
>
> I use the following command to convert HTML to text:
>
> lynx -stdin -dump \
> -underline_links -image_links \
> -display_charset="utf-8" -assume_charset="utf-8"
>
>
> My system information:
> I tested on the latest lynx-cur: lynx2.9.0dev.12
>
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
>
>
> What I found:
>
> I think it only prints the first byte instead of printing the processed
> codepoint (clong). I noticed in the file WWW/Library/Implementation/SGML.c
> there is a similar case for comments for example for "S_comment_put_c:".
>
> Below is a patch. I'm not sure it covers all lynx options though. I hope it
> does:
>
>
> diff --git a/WWW/Library/Implementation/SGML.c
> b/WWW/Library/Implementation/SGML.c
> index 2534606..8632670 100644
> --- a/WWW/Library/Implementation/SGML.c
> +++ b/WWW/Library/Implementation/SGML.c
> @@ -3502,9 +3502,13 @@ static void SGML_character(HTStream *me, int c_in)
> me->state = S_text;
> break;
> }
> - HTChunkPutc(string, c);
> - break;
>
> + if (me->T.decode_utf8) {
> + HTChunkPutUtf8Char(string, clong);
> + } else {
> + HTChunkPutc(string, c);
> + }
> + break;
> case S_sgmlent: /* Expecting ENTITY. - FM */
> if (!me->first_dash && c == '-') {
> HTChunkPutc(string, c);
>
>
> Thank you for lynx,
>
> --
> Kind regards,
> Hiltjo
>
Hi,
Any updates on the status / review of this patch?
Thank you,
--
Kind regards,
Hiltjo
- Re: [Lynx-dev] fix for decoding utf-8 in CDATA sections,
Hiltjo Posthuma <=