[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [PATCH] improved Test-idn-robots.txt
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] [PATCH] improved Test-idn-robots.txt |
Date: |
Wed, 09 Oct 2013 19:44:08 +0200 |
User-agent: |
KMail/4.10.5 (Linux/3.10-3-amd64; KDE/4.10.5; x86_64; ; ) |
Am Dienstag, 8. Oktober 2013, 15:07:51 schrieb Giuseppe Scrivano:
> Tim Rühsen <address@hidden> writes:
> > I added two links/urls to follow in index.html, now there are three in
> > total. All three links/urls point to the same host, but have different
> > host encodings (plain international text, punycoding, percent escaping).
> >
> > Wget should recognize these three codings as being the same and thus I
> > removed the -H (host spanning) option to verify that.
> >
> > Now, Wget fails this test, I guess it needs a fix.
> >
> > Regards, Tim
> >
> > From 2e6f527121497b3b148496a9a9c774451d2e0017 Mon Sep 17 00:00:00 2001
> > From: Tim Ruehsen <address@hidden>
> > Date: Mon, 7 Oct 2013 23:37:42 +0200
> > Subject: [PATCH] improved Test-idn-robots.px
> >
> > ---
> >
> > tests/ChangeLog | 5 +++++
> > tests/Test-idn-robots.px | 27 ++++++++++++++++++++++++++-
> > 2 files changed, 31 insertions(+), 1 deletion(-)
>
> thanks for your test. The IRI support is a bit of a mess and I am not
> sure how this issue should be fixed:
>
> Should we check if the two domains are the same in recur.c (somewhere
> near line 633)? It means that we will need to check there for
> different encodings and convert among them. Another solution would be
> that append_url stores the url in a specific format.
>
> Probably the latter solution allows us to also deal with page specific
> locales when it is specified.
>
> Have you already looked into this issue? Do you have any
> idea/suggestion?
I already solved this issue in this experimental tool Mget where I put the
URI/IRI parser into a library. I just can offer to contribute code from those
source to Wget/FSF. Maybe you take a look and see what fits for Wget (since
Mget does the same as Wget, it should fit).
The code for mget_iri_parse() is in
https://github.com/rockdaboot/mget/blob/master/libmget/iri.c
Mget 'normalizes' all URI/IRIs by
- decode percent encoding
- encode to utf-8
- parsing into host/path/query etc.
- encoding host with toASCII() (libidn2+libunistring or libidn) to ascii form
via mget_str_to_ascii(iri->host)
>From than on, this ascii form is taken as the host name for directories, DNS,
HTTP, comparing etc.
If i can give you a helping hand, contact me.
Regards, Tim
signature.asc
Description: This is a digitally signed message part.