wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

wget2 | HTTP Response 0 flooding. (#609)


From: marcel dope (@marceldope)
Subject: wget2 | HTTP Response 0 flooding. (#609)
Date: Fri, 29 Jul 2022 17:16:58 +0000


marcel dope created an issue: https://gitlab.com/gnuwget/wget2/-/issues/609



Trivia:

wget1 has a bug where it truncates filenames that aren't longer than the 255 
char limit imposed by the filesystem. Downloading a file with name 240 char 
long truncated it to 236 chars, and downloading the same file, but in a mirror 
mode (recursive plus create directories mirroring the whole path) it was 
truncated to 207 chars. The latter is because wget1 erroneously counts the path 
toward the filename char limit.
It's of utmost importance to me that the mirror I create be size, metadata and 
filename equal to the remote copy.
I tried the wget2 command shared next and it doesn't seem to have this bug and 
it's magnitude better in every aspect, so thank you for this.
BTW I've did extensive testing of wget2's behavior and found these differences 
with the documentation / expected behavior:
- `-R "index.html*"` - this option is ignored, indexes are downloaded anyway
- from the manpage of `--force-progress`: `This option will also force the 
progress bar to be printed to stderr when used alongside the --output-file 
option.` - this doesn't work, no progress bar
- `--progress=bar` - if specified nothing will be saved with -o output or if 
stdout is redirected to a file
- `--stats-all` - not implemented
 
 
 
**The problem:**

I use this command to mirror a website containing 10k small files: `wget2 -rNl 
inf -np --no-if-modified-since --retry-connrefused --waitretry=3600 
--retry-on-http-error=*,\!404 --https-enforce=hard -R "index.html*" 
--fsync-policy=on --random-wait --max-threads=1 -t inf --backups=99 -w 1 URL`. 
It works perfectly for a while and then I'm flooded with
```
[0] Checking 'URL' ...
HTTP response 0  [URL]
[0] Downloading 'URL' ...
HTTP response 0  [URL]
```
or
```
[0] Downloading 'URL' ...
HTTP response 0  [URL]
```
The `--stats-site` output is
```
  Status    ms   Size URL
       0     0      0 FILE
```
or
```
  Status    ms   Size URL
       0     1      0 FILE
```
for each try.
The tries seem to happen very fast (it could very well be 1000 tries per 
second) judging by the output and output's filesize increase.

Aborting the wget2 process and restarting it results in HTTP responses 200 
again (and eventually HTTP responses 0 again). This suggests to me that wget2 
may be flooding the server/cloudflare/my openwrt router with hundreds of 
requests each second sabotaging the mirroring process when it could simply wait 
a second/minute/hour and it would get a 200 response. I wish the 
`--waitretry=3600` and -w 1 I specified would apply here. It's likely that 
raising wait time between each try when I get 0 response would fix this, 
enabling the mirror to succeed. Unfortunately currently I have to re-parse and 
check (size, timestamp) all the 10k files again to continue mirroring. This 
highly increases the load on the server.

Questions:
1. Why do I get HTTP Reponse 0 in the first place? What does it mean?
2. Is this a wget2 bug, openwrt misfeature or a real cloudflare/server response?

Note that I don't think I ever got a HTTP Response 0 when testing mirroring the 
same website with wget1.

Suggestions:
Make `--retry-connrefused --retry-on-http-error --random-wait` all work for 
every request made by wget2, not just the downloads and make the wait time for 
all these rise up from `-w` number to the `--waitretry` number. I suggest 
`--waitretry` shouldn't increase the value each try by just 1 second, but by 
multiplication of that value times 2, fe. 1st try = 1s, 2nd try = 2s, 3rd try = 
4s.
wget2 could also postpone retries optionally or by default, ie. if you get a 
timeout, wait the retry wait time and try another file only to get back to the 
previous one once x time passes or other downloads finish.

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/609
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]