bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Support for <meta charset=...> tag


From: Tim Rühsen
Subject: Re: [PATCH] Support for <meta charset=...> tag
Date: Sun, 26 Jul 2020 13:52:34 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

Ah, sorry, just saw this email with your patch :-)

Could you attach your patch as attachment. Git can't am/apply your patch
here.

Regards, Tim

On 15.07.20 12:55, Sho Amano wrote:
> Hi! I've been using the first version of wget for a long time and first of 
> all,
> I want to say thank you to all of the maintainers and contributors of
> this project!
> 
> I was looking at the code recently to find that it doesn't support
> "<meta charset=...>" tag yet.
> I don't see any issues in bug tracker related to this, so I created a patch.
> I'm hoping it helps.
> 
> I also attach two HTML files for verification. One of them specifies
> Japanese path
> in UTF-8, others does in Shift-JIS. Serve these files on localhost:8080, and 
> let
> wget follow the link. (e.g. `wget -d --recursive --level=2
> http://localhost:8080/charset_test_shift_jis.html`) Verify that in
> both cases, wget tries to download
> http://localhost:8080/%E6%97%A5%E6%9C%AC%E8%AA%9E.html.
> 
> Thanks!
> Sho Amano
> 
> ---
>  src/html-url.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/src/html-url.c b/src/html-url.c
> index b80cf269..5324d244 100644
> --- a/src/html-url.c
> +++ b/src/html-url.c
> @@ -182,6 +182,7 @@ static const char *additional_attributes[] = {
>    "http-equiv",                 /* used by tag_handle_meta  */
>    "name",                       /* used by tag_handle_meta  */
>    "content",                    /* used by tag_handle_meta  */
> +  "charset",                    /* used by tag_handle_meta  */
>    "action",                     /* used by tag_handle_form  */
>    "style",                      /* used by check_style_attr */
>    "srcset",                     /* used by tag_handle_img */
> @@ -191,7 +192,7 @@ static struct hash_table *interesting_tags;
>  static struct hash_table *interesting_attributes;
> 
>  /* Will contains the (last) charset found in 'http-equiv=content-type'
> -   meta tags  */
> +   or 'charset' meta tags  */
>  static char *meta_charset;
> 
>  static void
> @@ -574,6 +575,7 @@ tag_handle_meta (int tagid _GL_UNUSED, struct
> taginfo *tag, struct map_context *
>  {
>    char *name = find_attr (tag, "name", NULL);
>    char *http_equiv = find_attr (tag, "http-equiv", NULL);
> +  char *charset = find_attr (tag, "charset", NULL);
> 
>    if (http_equiv && 0 == c_strcasecmp (http_equiv, "refresh"))
>      {
> @@ -673,6 +675,20 @@ tag_handle_meta (int tagid _GL_UNUSED, struct
> taginfo *tag, struct map_context *
>              }
>          }
>      }
> +  else if (charset)
> +    {
> +      /* Handle stuff like:
> +         <meta charset="CHARSET">
> +         If charset is acquired from http-equiv then it is overwritten. */
> +
> +      /* Do a minimum check on the charset value */
> +      if (check_encoding_name (charset))
> +        {
> +          char *mcharset = xstrdup (charset);
> +          xfree (meta_charset);
> +          meta_charset = mcharset;
> +        }
> +    }
>  }
> 
>  /* Handle the IMG tag.  This requires special handling for the srcset attr,
> 

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]