bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Exclusion failures


From: Roger Brooks
Subject: Exclusion failures
Date: Mon, 28 Jun 2021 19:36:47 +0200

I am trying to use wget 1.19.1 to back up a club website.  Here is a reduced
version of my wget command, which only accesses the public parts of the
website:
>>
cd /volume1/Backup/
wget -EkKrNpH \
     --output-file=wget.log \
     --domains=imcz.club,sf.wildapricot.org \
     --exclude-domains=webmail.imcz.club \
     --exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,fonts
\
     --ignore-case \
     --level=2 \
     --no-parent \
     --no-proxy \
     --random-wait \
     --reject=ashx,"overlay*" \
     
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
\
     --rejected-log=wget-rejected.log \
     --restrict-file-names=windows \
     --wait=1 \
     https://imcz.club/
<<

Two of the exclusions in the command are failing:

1. -exclude-directories=Fonts, fonts
This is a workaround for wget’s creation of spurious font directories.  The
server has only one such directory, but the website’s backend platform (over
which I have no control) sometimes addresses it as “fonts” and sometimes as
“Fonts”.
I expected that the option "--ignore-case" in the absence of "--no-clobber"
would take care of this problem, but since the contents are static, I don’t
need to back it up regularly.  Despite the exclusion, wget still insists on
creating the following directories:
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict"
The resulting backup website does not find the fonts in the "_Conflict"
directories; they have to be copied into the "fonts" directory for the pages
in the mirrored site to display properly.

2. 
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
\
This is an attempt to prevent duplicate downloading of files. The following
file is downloaded, even though https://regex101.com says that it matches my
regex:
"W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
It is effectively a duplicate of:
"W:\imcz.club\event-4193082.html"
Increasing "--level" produces additional examples.

I am aware that 1.19.1 is not the latest version, but wget is running on a
Synology DiskStation, which makes it difficult to update.
I haven't found any indication that these problems are known bugs which have
since been fixed.
Any advice is welcome!



reply via email to

[Prev in Thread] Current Thread [Next in Thread]