[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget -crNl inf --- filenames mangled
From: |
Andres Valloud |
Subject: |
Re: [Bug-wget] wget -crNl inf --- filenames mangled |
Date: |
Sun, 17 Feb 2019 15:02:22 -0800 |
User-agent: |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 |
Hi, so I ran wget like this:
wget --no-check-certificate -dcrNl inf $baseUrl/root/pub/mods/2012/ -P
$baseLocal -o wget-mods-2012.log
Looking at the log, '1f43' appears (I think) as a consequence of -l inf,
because .../mods/2012/ has a reference to .../mods/, which leads wget to
read the entire .../mods/ index.
According to my understanding of the log file, wget then collects all
the possible URLs from .../mods/. It is here that, after what seems
like thousands of file, a single merge log entry shows '1f43' (some path
parts elided).
.../root/pub/mods/index.html?C=N;O=D:
merge(‘.../root/pub/mods/?C=N;O=D’, ‘lizardking_-_quest.mp31f43’) ->
.../root/pub/mods/lizardking_-_quest.mp31f43
appending ‘.../root/pub/mods/lizardking_-_quest.mp31f43’ to urlpos.
Then I issued the command (some path parts elided)
wget --no-check-certificate .../root/pub/mods/
which resulted in a 974kb index.html file that has no occurrences of
'1f43' (more on this request down below).
I wondered whether this could be happening because there are .html files
that *do* have '1f43' already downloaded in the local downloading
directory. That is, will wget look at existing files, or will it
download them from scratch? But the log file seems to indicate the
index.html was downloaded from scratch, not examined from disk.
The "bad" request looks like this (some path parts elided):
---request begin---
GET /root/pub/mods/?C=N;O=D HTTP/1.1^M
Referer: .../root/pub/mods/^M
If-Modified-Since: Sun, 10 Feb 2019 02:33:09 GMT^M
Range: bytes=998575-^M
User-Agent: Wget/1.20.1 (linux-gnu)^M
Accept: */*^M
Accept-Encoding: identity^M
Host: saphirjd.me^M
Connection: Keep-Alive^M
^M
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK^M
Date: Sat, 16 Feb 2019 21:51:21 GMT^M
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
Keep-Alive: timeout=2, max=18^M
Connection: Keep-Alive^M
Transfer-Encoding: chunked^M
Content-Type: text/html;charset=UTF-8^M
^M
---response end---
200 OK
Length: unspecified [text/html]
Saving to: ‘.../root/pub/mods/index.html?C=N;O=D’
0K .......... .......... .......... .......... .......... 234K
50K .......... .......... .......... .......... .......... 11.6M
100K .......... .......... .......... .......... .......... 14.4M
150K .......... .......... .......... .......... .......... 238K
200K .......... .......... .......... .......... .......... 657K
250K .......... .......... .......... .......... .......... 11.3M
300K .......... .......... .......... .......... .......... 8.44M
350K .......... .......... .......... .......... .......... 397K
400K .......... .......... .......... .......... .......... 627K
450K .......... .......... .......... .......... .......... 2.38M
500K .......... .......... .......... .......... .......... 4.47M
550K .......... .......... .......... .......... .......... 3.46M
600K .......... .......... .......... .......... .......... 477K
650K .......... .......... .......... .......... .......... 4.14M
700K .......... .......... .......... .......... .......... 717K
750K .......... .......... .......... .......... .......... 3.50M
800K .......... .......... .......... .......... .......... 3.01M
850K .......... .......... .......... .......... .......... 4.40M
900K .......... .......... .......... .......... .......... 2.69M
950K .......... .......... ... 68.9K=1.4s
Last-modified header missing -- time-stamps turned off.
2019-02-16 13:51:25 (717 KB/s) - ‘.../root/pub/mods/index.html?C=N;O=D’
saved [998575]
Loaded .../root/pub/mods/index.html?C=N;O=D (size 998575).
The "good" request looks like this:
---request begin---
GET /root/pub/mods/ HTTP/1.1^M
User-Agent: Wget/1.20.1 (linux-gnu)^M
Accept: */*^M
Accept-Encoding: identity^M
Host: saphirjd.me^M
Connection: Keep-Alive^M
^M
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK^M
Date: Sun, 17 Feb 2019 22:42:04 GMT^M
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h^M
Keep-Alive: timeout=2, max=25^M
Connection: Keep-Alive^M
Transfer-Encoding: chunked^M
Content-Type: text/html;charset=UTF-8^M
^M
---response end---
200 OK
Registered socket 5 for persistent reuse.
Length: unspecified [text/html]
Saving to: ‘index.html.1’
0K .......... .......... .......... .......... .......... 71.1K
50K .......... .......... .......... .......... .......... 221K
100K .......... .......... .......... .......... .......... 241K
150K .......... .......... .......... .......... .......... 232K
200K .......... .......... .......... .......... .......... 4.81M
250K .......... .......... .......... .......... .......... 1.64M
300K .......... .......... .......... .......... .......... 249K
350K .......... .......... .......... .......... .......... 2.49M
400K .......... .......... .......... .......... .......... 3.71M
450K .......... .......... .......... .......... .......... 258K
500K .......... .......... .......... .......... .......... 1.41M
550K .......... .......... .......... .......... .......... 1.46M
600K .......... .......... .......... .......... .......... 2.32M
650K .......... .......... .......... .......... .......... 340K
700K .......... .......... .......... .......... .......... 2.19M
750K .......... .......... .......... .......... .......... 4.10M
800K .......... .......... .......... .......... .......... 2.68M
850K .......... .......... .......... .......... .......... 3.17M
900K .......... .......... .......... .......... .......... 3.22M
950K .......... .......... ... 2.07M=2.1s
2019-02-17 14:42:09 (453 KB/s) - ‘index.html.1’ saved [997015]
So I examined the "bad" html file. Unlike the "good" file, the "bad"
file starts like this (contents enclosed by ====== bars):
======================================================================
13a
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>416 Requested Range Not Satisfiable</title>
</head><body>
<h1>Requested Range Not Satisfiable</h1>
<p>None of the range-specifier values in the Range
request-header field overlap the current extent
of the selected resource.</p>
</body></html>
0
HTTP/1.1 200 OK
Date: Sun, 10 Feb 2019 02:33:04 GMT
Server: Apache/2.4.23 (Win64) OpenSSL/1.0.2h
Keep-Alive: timeout=2, max=24
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html;charset=UTF-8
ee3
======================================================================
The "13a" and "ee3" characters are present in the file. This data also
seems to explain why the file saved to disk is about 1kb larger than the
file downloaded individually. It looks like the index.html file saved
to disk contains (i.e. begins with) garbage from a different request
that ended in 416. After that prolog of apparent junk, the file proper
seems to begin as expected --- but it also has several occurrences of
'1f43'.
A vimdiff run on bad.html and good.html shows some order differences,
seemingly a table replaced with '1f43', and things of that nature. The
structure of the differences is not immediately obvious, as there are
very large sections that differ seemingly because the file was served in
different order.
Andres.
On 2/17/19 12:15, Tim Rühsen wrote:
On 16.02.19 23:02, Andres Valloud wrote:
Tim,
I limited the data from 99gb to 3.3gb, and just to the directory where
I've seen the problem occurs. The strange string '1f43' appears in this
limited setup. The '1f43' substring seems to appear deterministically
depending on the file name (I have not checked *every* occurrence by hand).
How should I track this down?
I'd use -d -olog and leave away -k. If 1f43 still appears, we know it's
not because of wget's parsing or conversion. In this case it#s from the
server... check in which file 1f43 appears and find the request in the
log file.
Then try to download that file with a single (non-recursive) wget
command. Check if 1f43 appears in there. If it doesn't, compare both
requests to see the difference.
Let us know the results.
Regards, Tim
Andres.
On 2/14/19 04:03, Tim Rühsen wrote:
On 2/14/19 12:25 PM, Andres Valloud wrote:
Tim,
On 2/14/19 02:03, Tim Rühsen wrote:
I looked at the downloaded html files with grep. They do contain the
substring "1f43", seemingly after a ^M character (I did not check
every
single occurrence). Sometimes, the ^M character is within a file name
such as this:
<tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
1f43^M
"
If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
ignore it. This is nothing that can be addressed with
--restrict-file-names.
But to make sure, look at the original file by downloading it with
'wget
<URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
so, we can't do much about it.
If all looks ok in there, please attach both files so we can compare
and
possibly reproduce.
If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
request is coming via Firefox.
curl and wget have both the --user-agent option for this.
Do you get a different file when using that option ?
There was one additional detail to make this work. Instead of placing a
request for index.html, I had to ask curl to get just the directory name
ending with a slash. Then the server responded with (essentially)
index.html.
A web server might give different content on 'dir', 'dir/' and
'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/'
can't be used as filename - so we use 'dir/index.html' for that. Which
is not correct if the server serves 'dir/index.php' when we request
'dir/'.
Both curl and wget retrieve index.html contents without '1f43' when
asking for just that URL. vimdiff says the retrieved files are
identical.
Try to start with this URL using your original wget command line. You
could add a quota (-Q) to limit the amount of data. In the hope to
reproduce your issue with far less files/data to be downloaded.
I am at a loss as to how to explain how the '1f43' problem appears when
asking wget to update the mirror of the site (rather than downloading a
single file). I'll look at the log file tomorrow and see if I get more
ideas.
Try to reduce the needed amount of data to reproduce it.
Regards, Tim