bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug or at least strange behaviour


From: Tim Rühsen
Subject: Re: Bug or at least strange behaviour
Date: Fri, 17 Apr 2020 17:27:35 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0

Hi,

-A / -R is applied before downloading a file and in
http://www.cs.toronto.edu/maxsat-lib/maxsat-instances/master-set/index.html
all the subdirectories are referred to as files, not as subdirectories
(a trailing / would indicate a subdirectory).

Indeed, wget should use a HEAD request before applying -A / -R. And only
apply these filter options if the resulting mime type is not text/html
or text/css. So this looks like a bug that should be fixed. My time is
currently very limited, so maybe someone jumps in and gives it a try ?

You could check if homebrew provides wget2. Wget2 does it correctly and
would do what you expect.

Regards, Tim

On 16.04.20 20:07, Fahiem Bacchus wrote:
> Hi, I am creating an scientific archive containing problem sets and want to
> post wget instructions for downloading the problem sets.
> 
> 1. wget -r -nd -erobots=off
> http://www.cs.toronto.edu/maxsat-lib/maxsat-instances/master-set/unweighted
> -A 'zip'
>     Works, it descends to the subdirectories under unweighted, and
> retrieves the zip files in contained in each subdirectory.
> 2. wget -r -nd -erobots=off
> http://www.cs.toronto.edu/maxsat-lib/maxsat-instances/master-set/ -A 'zip'
>     Does not work it stops after rejecting the index.html file in
> master-set.
> 3. wget -r -nd -erobots=off
> http://www.cs.toronto.edu/maxsat-lib/maxsat-instances/master-set/
>     Kind of works, it gets all of the files, but does not restrict itself
> to the zip files.
> 
> Maybe I don't understand the options? But it looks like a bug in the
> interaction of the -A flag and descending into
> subdirectories?
> 
>    thanks
>      Fahiem Bacchus
> 
> Here is the site
> http://www.cs.toronto.edu/maxsat-lib/
> 
> With directory structure:
>   master-instances
>        master-set
>             unweighted
>                 CircuitDebuggingProblems
>                         CircuitDebuggingProblems.zip
>                 .... many other subdirs each containing a zip
>             weighted
>                 many subdirs each containing a zip
>        ms-evals
>        original
> 
> I also tried a -l 10 flag...did not help.
> 
> Version info:
> ============
> GNU Wget 1.20.3 built on darwin18.6.0.
> 
> -cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls
> +ntlm +opie -psl +ssl/openssl
> 
> Wgetrc:
>     /usr/local/etc/wgetrc (system)
> Locale:
>     /usr/local/Cellar/wget/1.20.3_1/share/locale
> Compile:
>     clang -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc"
>     -DLOCALEDIR="/usr/local/Cellar/wget/1.20.3_1/share/locale" -I.
>     -I../lib -I../lib -I/usr/local/opt/openssl@1.1/include -DNDEBUG -g
>     -O2
> Link:
>     clang -DNDEBUG -g -O2 -lidn2 -L/usr/local/opt/openssl@1.1/lib -lssl
>     -lcrypto -ldl -lz ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a
>     -liconv -lintl -Wl,-framework -Wl,CoreFoundation -lunistring
> 
> Copyright (C) 2015 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> <http://www.gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> 
> Originally written by Hrvoje Niksic <address@hidden>.
> Please send bug reports and questions to <address@hidden>.
> =========
> /usr/local/etc/wgetrc
> --------------------------
> ###
> ### Sample Wget initialization file .wgetrc
> ###
> 
> ## You can use this file to change the default behaviour of wget or to
> ## avoid having to type many many command-line options. This file does
> ## not contain a comprehensive list of commands -- look at the manual
> ## to find out what you can put into this file. You can find this here:
> ##   $ info wget.info 'Startup File'
> ## Or online here:
> ##   https://www.gnu.org/software/wget/manual/wget.html#Startup-File
> ##
> ## Wget initialization file can reside in /usr/local/etc/wgetrc
> ## (global, for all users) or $HOME/.wgetrc (for a single user).
> ##
> ## To use the settings in this file, you will have to uncomment them,
> ## as well as change them, in most cases, as the values on the
> ## commented-out lines are the default values (e.g. "off").
> ##
> ## Command are case-, underscore- and minus-insensitive.
> ## For example ftp_proxy, ftp-proxy and ftpproxy are the same.
> 
> 
> ##
> ## Global settings (useful for setting up in /usr/local/etc/wgetrc).
> ## Think well before you change them, since they may reduce wget's
> ## functionality, and make it behave contrary to the documentation:
> ##
> 
> # You can set retrieve quota for beginners by specifying a value
> # optionally followed by 'K' (kilobytes) or 'M' (megabytes).  The
> # default quota is unlimited.
> #quota = inf
> 
> # You can lower (or raise) the default number of retries when
> # downloading a file (default is 20).
> #tries = 20
> 
> # Lowering the maximum depth of the recursive retrieval is handy to
> # prevent newbies from going too "deep" when they unwittingly start
> # the recursive retrieval.  The default is 5.
> #reclevel = 5
> 
> # By default Wget uses "passive FTP" transfer where the client
> # initiates the data connection to the server rather than the other
> # way around.  That is required on systems behind NAT where the client
> # computer cannot be easily reached from the Internet.  However, some
> # firewalls software explicitly supports active FTP and in fact has
> # problems supporting passive transfer.  If you are in such
> # environment, use "passive_ftp = off" to revert to active FTP.
> #passive_ftp = off
> 
> # The "wait" command below makes Wget wait between every connection.
> # If, instead, you want Wget to wait only between retries of failed
> # downloads, set waitretry to maximum number of seconds to wait (Wget
> # will use "linear backoff", waiting 1 second after the first failure
> # on a file, 2 seconds after the second failure, etc. up to this max).
> #waitretry = 10
> 
> 
> ##
> ## Local settings (for a user to set in his $HOME/.wgetrc).  It is
> ## *highly* undesirable to put these settings in the global file, since
> ## they are potentially dangerous to "normal" users.
> ##
> ## Even when setting up your own ~/.wgetrc, you should know what you
> ## are doing before doing so.
> ##
> 
> # Set this to on to use timestamping by default:
> #timestamping = off
> 
> # It is a good idea to make Wget send your email address in a `From:'
> # header with your request (so that server administrators can contact
> # you in case of errors).  Wget does *not* send `From:' by default.
> #header = From: Your Name <username@site.domain>
> 
> # You can set up other headers, like Accept-Language.  Accept-Language
> # is *not* sent by default.
> #header = Accept-Language: en
> 
> # You can set the default proxies for Wget to use for http, https, and ftp.
> # They will override the value in the environment.
> #https_proxy = http://proxy.yoyodyne.com:18023/
> #http_proxy = http://proxy.yoyodyne.com:18023/
> #ftp_proxy = http://proxy.yoyodyne.com:18023/
> 
> # If you do not want to use proxy at all, set this to off.
> #use_proxy = on
> 
> # You can customize the retrieval outlook.  Valid options are default,
> # binary, mega and micro.
> #dot_style = default
> 
> # Setting this to off makes Wget not download /robots.txt.  Be sure to
> # know *exactly* what /robots.txt is and how it is used before changing
> # the default!
> #robots = on
> 
> # It can be useful to make Wget wait between connections.  Set this to
> # the number of seconds you want Wget to wait.
> #wait = 0
> 
> # You can force creating directory structure, even if a single is being
> # retrieved, by setting this to on.
> #dirstruct = off
> 
> # You can turn on recursive retrieving by default (don't do this if
> # you are not sure you know what it means) by setting this to on.
> #recursive = off
> 
> # To always back up file X as X.orig before converting its links (due
> # to -k / --convert-links / convert_links = on having been specified),
> # set this variable to on:
> #backup_converted = off
> 
> # To have Wget follow FTP links from HTML files by default, set this
> # to on:
> #follow_ftp = off
> 
> # To try ipv6 addresses first:
> #prefer-family = IPv6
> 
> # Set default IRI support state
> #iri = off
> 
> # Force the default system encoding
> #localencoding = UTF-8
> 
> # Force the default remote server encoding
> #remoteencoding = UTF-8
> 
> # Turn on to prevent following non-HTTPS links when in recursive mode
> #httpsonly = off
> 
> # Tune HTTPS security (auto, SSLv2, SSLv3, TLSv1, PFS)
> #secureprotocol = auto
> 

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]