[Bug-wget] escaped URLs and recursive retrieval

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] escaped URLs and recursive retrieval

From:	Bram Vandoren
Subject:	[Bug-wget] escaped URLs and recursive retrieval
Date:	Mon, 21 Nov 2011 16:20:29 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110920 SUSE/3.1.15 Thunderbird/3.1.15

Hi,

I encountered a bug in wget that occurs with recursive retrieval: if apage contains 2 (or more) links:

<a href="http://example.com/~user/blah";> and
<a href="http://example.com/%7Euser/blah";>

Both links point to the same page but the encoding is different. wgetdoesn't recognise this as the same page and downloads the page 'blah'twice. It also overwrites the first downloaded file.Also if you specify the conversion option '-k', it only converts one ofthe two links.

I had a quick look at the source code. It can be solved by changingurl_parse in url.c. Call url_unescape before parsing the url. This wayyou get a the same parsed url for both links. I am not sure if this is agood way to solve it. The conversion should probably be similar to theconversion that's done to determine the file name of the URL.


Kind regards,
Bram.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] escaped URLs and recursive retrieval, Bram Vandoren <=
- Re: [Bug-wget] escaped URLs and recursive retrieval, Paul Wratt, 2011/11/22

Prev by Date: [Bug-wget] improper --no-clobber --conver-links handling
Next by Date: Re: [Bug-wget] escaped URLs and recursive retrieval
Previous by thread: [Bug-wget] improper --no-clobber --conver-links handling
Next by thread: Re: [Bug-wget] escaped URLs and recursive retrieval
Index(es):
- Date
- Thread