bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] download page-requisites with spanning hosts


From: Petr Pisar
Subject: Re: [Bug-wget] download page-requisites with spanning hosts
Date: Thu, 30 Apr 2009 11:19:24 +0200
User-agent: Mutt/1.5.16 (2007-06-09)

On Thu, Apr 30, 2009 at 03:31:21AM -0500, Jake b wrote:
> On Thu, Apr 30, 2009 at 3:14 AM, Petr Pisar <address@hidden> wrote:
> >
> > On Wed, Apr 29, 2009 at 06:50:11PM -0500, Jake b wrote:
> but i'm not sure how to tell wget that the output html file should be named.
> 
wget -O OUTPUT_FILE_NAME

> > > How do I make wget download all images on the page? I don't want to
> > > recurse other hosts, or even sijun, just download this page, and all
> > > images needed to display it.
> > >
> > That's not easy task. Especially because all big desktop images are stored
> > on other servers. I think wget is not enough powerfull to do it all on its
> > own.
> 
> Are you saying because some services show a thumbnail, then click to do the
> full image? 
[…]
> Would it be simpler to say something like: download page 912, recursion
> level=1 ( or 2? ), except for non-image links. ( so it only allows recursion
> on images, ie: downloading "randomguyshost.com/3.png"
> 
You can limit downloads according file name extentions (option -A), however
this will remove the sole main HTML file and prevent you in recursion. And
no, there is no option to download only files pointed from special HTML
element like IMG.

Without the -A option, you get a lot of useless files (reagardless spanning).

If you look on locations of files you are interested in, you will see that all
the files are located outside the Sijun domain. On every page is only small
amount of such files. Thus it's more efficient and friendly to the servers to
extract these URLs only at first and then download them only.

> But the problem that it does not span any hosts? Is there a way I can
> achieve this, if I do the same, except, allow span everybody, recurse
> lvl=1, and only recurse non-images.
>
There is option -H for spanning. Following wget-only command does what you
want, but as I said it produce a lot of useless requests and files.

wget -p -l 1 -H 
'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
 


> > I propose using other tools to extract the image ULRs and then to download
> > them using wget. E.g.:
> 
> I guess I could use wget to get the html, and parse that for image tags
> manually, but, then I don't get the forum thread comments. Which isn't
> required, but would be nice.

You can do both: extract image URLs and extract comments.

> 
> > wget -O
> > - 
> > 'http://forums.sijun.com/viewtopic.php?t=29807&postdays=0&postorder=asc&start=27+330'
> > | grep -o -E 'http:\/\/[^"]*\.(jpg|jpeg|png)' | wget -i -
> >
> Ok, will have to try it out. ( In windows ATM so I can't pipe. )
> 
AFAIK Windows shells command.com and cmd.exe supports pipes.

> Using python, and I have dual boot if needed.
> 
Or you can execute programs connected through pipes in Python.

-- Petr

Attachment: pgpZNV6E8X9ii.pgp
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]