Hi Frans,
my apologies, maybe I stopped the download too fast.
The command line with the artworks regex indeed has no effect.
In fact, after looking into the code, I can confirm that I hardly see
any of the filtering applied to FTP URLs that we apply to HTTP.
I am currently not sure if that is a regression or if that possibly
never worked. Maybe that was intended / planned by the original authors.
Sorry, this also puzzles me a bit... have to test with older versions
when time allows.
Regards, Tim
On 26.12.20 15:12, Frans de Boer wrote:
On 25-12-2020 18:42, Tim Rühsen wrote:
Hello Franz,
tried with wget 1.20.3 and these both command work:
#1 Do not download smc/artworks/ directory:
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp
--no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/
--reject-regex=".*(/artworks/.*)"
#2 Do not download .bz2 and .rpm files
wget -d -4 --mirror -nH -np --retr-symlinks=no --passive-ftp
--no-verbose --cut-dirs=1 ftp://mirror.netcologne.de/savannah/smc/
--reject-regex=".*(\.bz2|\.rpm)$"
(--regex-type=posix is default)
(the order of URL and options doesn't matter)
Regards, Tim
On 23.12.20 13:48, Frans de Boer wrote:
LS,
I found that wget 1.20 and later do support some basic regular
expressions. I had good results with --accept=-regex but the
reject part is more troublesome. I can't use ERE's since only
BRE's is supported with the notion that the whole URL should be
included.
I use wget to mirror some sites, but I do not want certain sub
directories included in the download. You can think of sub
directories named rpm, debug, temp etc.
Example:
wget -4 --mirror -nH -np --retr-symlinks=no --passive-ftp
--no-verbose --cut-dirs=1 --regex-type posix --reject-regex
"ftp\:\/\/mirror\.netcologne\.de\/savannah\/smc\/Screensaver\/" -P
./debugdir/nongnu ftp://mirror.netcologne.de/savannah/smc/
I tried this example with or without partial backslashes, but none
is working. I tried this also with a single file, to no avail too.
I understand that one can added multiple reject statements but
would rather use the ERE .*(dir1|dir2|dir3|...|dirx|(..ERE..)),
but that is rather cumbersome when I have to specify them by hand.
I do have already a ERE string ready and would like to use that
instead. Breaking down this string again into multiple reject
statement might also not work if I can't even reject one file or
sub directory.
Is there a way to accomplish above without having to resort to
loops and sed as the filtering tool?
Regards, Frans
Hello Tim,
Alas, using wget version 1.20.3 under openSUSE 15.2 the line with
excluding the artworks directory is not working. The whole artworks
sub directory is loaded. To be sure, I also copied your line exactly
to see if that makes a different. By the way, I tried this also
under openSUSE Tumbleweed. The -d option does not indicate anything
about the used regex.
The strange thing is that when I use a similar approach for python,
I am able to use the following arguments to the reject statement:
".*/(amd64|binaries|Debug|debug|deleted|OLD|old|Patches|patches|prev|previous|rpm|RPM|rpms|RPMS|temp|tmp|w32
|win32|.*(rc|RC|a|b|p)[[:digit:]]{1}.*)/.*" - my universal string
for all other projects too.
With this I have to add that I also use an --accept-regex for python
and no such addition for nongnu.
So, I wonder why it seems to work on your side and not at my side.
--- Frans