|
From: | Bram Vandoren |
Subject: | [Bug-wget] escaped URLs and recursive retrieval |
Date: | Mon, 21 Nov 2011 16:20:29 +0100 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110920 SUSE/3.1.15 Thunderbird/3.1.15 |
Hi,I encountered a bug in wget that occurs with recursive retrieval: if a page contains 2 (or more) links:
<a href="http://example.com/~user/blah"> and <a href="http://example.com/%7Euser/blah">Both links point to the same page but the encoding is different. wget doesn't recognise this as the same page and downloads the page 'blah' twice. It also overwrites the first downloaded file. Also if you specify the conversion option '-k', it only converts one of the two links.
I had a quick look at the source code. It can be solved by changing url_parse in url.c. Call url_unescape before parsing the url. This way you get a the same parsed url for both links. I am not sure if this is a good way to solve it. The conversion should probably be similar to the conversion that's done to determine the file name of the URL.
Kind regards, Bram.
[Prev in Thread] | Current Thread | [Next in Thread] |