bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Exclusion failures


From: Tim Rühsen
Subject: Re: Exclusion failures
Date: Thu, 8 Jul 2021 19:54:08 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

I think i don't understand your font/ problem correctly, sorry.

The regex issue seems to be that wget is using POSIX regex by default.
Please try to use --regex-type=pcre for PCRE regex.

You can get the latest version of wget built for Windows (incl. PCRE support) at https://eternallybored.org/misc/wget/.

Regards, Tim

On 08.07.21 16:26, Roger Brooks wrote:
Thanks for the explanations. Unfortunately, I don't find them convincing:


So the fonts/ directory is not automatically deleted by wget when it is
empty. It was used for temporary files during the download.
<<
Actually, the "fonts" directory is *not* empty, nor are the "Fonts_*
_Conflict" directories.


Why should '@CalendarView' match 'calendar[@/?]' ?
<<
The component of the regex which should match is not "calendar[@\?].*" (the
first term in the regex). It is "event-\d+[@\?].*" (the fourth and last term
in the regex).
Once again, https://regex101.com/ confirms that
"event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html" matches
this term.

Thanks for your support.

-----Original Message-----
From: Tim Rühsen <tim.ruehsen@gmx.de>
Sent: Monday, July 5, 2021 4:09 PM
To: Roger Brooks <r.s.brooks@ieee.org>; bug-wget@gnu.org
Subject: Re: Exclusion failures

On 28.06.21 19:36, Roger Brooks wrote:
I am trying to use wget 1.19.1 to back up a club website.  Here is a
reduced version of my wget command, which only accesses the public
parts of the
website:

cd /volume1/Backup/
wget -EkKrNpH \
       --output-file=wget.log \
       --domains=imcz.club,sf.wildapricot.org \
       --exclude-domains=webmail.imcz.club \

--exclude-directories=calendar,Club-Events,External-Events,Sys,Fonts,f
onts
\
       --ignore-case \
       --level=2 \
       --no-parent \
       --no-proxy \
       --random-wait \
       --reject=ashx,"overlay*" \
       
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
\
       --rejected-log=wget-rejected.log \
       --restrict-file-names=windows \
       --wait=1 \
       https://imcz.club/
<<

Two of the exclusions in the command are failing:

1. -exclude-directories=Fonts, fonts
This is a workaround for wget’s creation of spurious font directories.
The server has only one such directory, but the website’s backend
platform (over which I have no control) sometimes addresses it as
“fonts” and sometimes as “Fonts”.
I expected that the option "--ignore-case" in the absence of
"--no-clobber"
would take care of this problem, but since the contents are static, I
don’t need to back it up regularly.  Despite the exclusion, wget still
insists on creating the following directories:
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\fonts"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230456-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230459-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230501-2021_Conflict"
"W:\imcz.club\BuiltTheme\whiteboard_maya_blue.v3.0\10e4279e\Fonts_ADMIN_Jun-27-230504-2021_Conflict"
The resulting backup website does not find the fonts in the "_Conflict"
directories; they have to be copied into the "fonts" directory for the
pages in the mirrored site to display properly.

So the fonts/ directory is not automatically deleted by wget when it is
empty. It was used for temporary files during the download.
This is a known "issue", but since an empty directory doesn't eat too much
space on a disk, it wasn't fixed yet (maybe nobody thought it is relevant).
Wget2 doesn't have this issue.

I don't know where the *_Conflict/ directories are from. Seems like a server
thing.


2. 
--reject-regex="calendar[@\?].*|Club-Events[@\?].*|External-Events[@\?].*|event-\d+[@\?].*"
\
This is an attempt to prevent duplicate downloading of files. The
following
file is downloaded, even though https://regex101.com says that it matches
my
regex:
"W:\imcz.club\event-4193082@CalendarViewType=1&SelectedDate=6%2F27%2F2021.html"
It is effectively a duplicate of:
"W:\imcz.club\event-4193082.html"
Increasing "--level" produces additional examples.

Why should '@CalendarView' match 'calendar[@/?]' ?
Maybe your regex should be '[@\?]calendar.*' !?

Regards, Tim


Attachment: OpenPGP_signature
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]