bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Check external reference, but don't process further


From: Darshit Shah
Subject: Re: [Bug-wget] Check external reference, but don't process further
Date: Tue, 27 Nov 2018 13:30:25 +0100
User-agent: NeoMutt/20180716

Hi Fernando,

As far as I'm aware there is no way to limit the recursion depth only on
foreign hosts. Something like this would definitely be a lot easier to do using
Wget2 which offers a few more powerful tools that Wget does. Wget2's alpha is
currently available in the Debian repositories and Arch Linux's AUR.

If you'd still like to continue using Wget, one way to pull this off would be
to have Wget print its debug output and then parse that to extract all the URIs
on foreign hosts. You can then have a second invokation of Wget to test for
their existence. An example of doing this would be:

$ wget -r --spider -d exmaple.com | grep -B1 "This is not the same hostname as 
the parent's" | grep "Deciding whether to enqueue" | sed 
's/.*\"\(.*\)\"\./\1/g' | wget --spider -i-

Of course, you may want to modify this to meet your own needs, but the general
idea should work for you

* Fernando Gont <address@hidden> [181127 13:08]:
> Folks,
> 
> I'm using wget in a script to check for broken links in a web site,
> which uses the "--spider" mode.
> 
> I'd like wget to operate in recursive mode for pages in the target
> domain, but not for pages in other hosts/sites.
> 
> That is, if I'm crawling www.example.com, I'd like wget to process all
> pages in that domain recursively. However, if there's a link to an
> external site, I just want wget to check that URL, but not process that
> external reference recursively.
> 
> "-D" would seem to prevent checking external references, so I cannot use
> it. And "--level" would mean that pages on external sites my still be
> processed recursively.
> 
> Any advice on how to implement this?
> 
> Thanks!
> 
> Cheers,
> Fernando
> 
> 
> 
> 
> -- 
> Fernando Gont
> SI6 Networks
> e-mail: address@hidden
> PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492
> 
> 
> 
> 
> 
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]