bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Dealing with the frameset tag (paid? patch required)


From: Henry C.
Subject: [Bug-wget] Dealing with the frameset tag (paid? patch required)
Date: Sat, 23 Jan 2010 12:43:03 +0200
User-agent: SquirrelMail/1.5.2 [SVN]

Greets!

Does anyone have any ideas on how to deal with a website which uses the
frameset tag which points to another host?

This is basically what I want to achieve:

- if the page of a website uses a frameset tag pointing to a different
host (which will under normal operation be ignored), I want wget to grab a
single page from that address (store it locally as if it's part of the
current site) and continue normal operation (which will probably mean
exiting).

The problem that I have is that if a site uses a frameset for *all* it's
content, then basically nothing gets downloaded.

I don't want to use --span-hosts since it might affect other crawl sessions.

I use a patched* wget (v1.10.2) as the guts of a web crawler.  Since
everyone uses Google, it's search results are the baseline.  I've found
that this is basically what Google is doing:  if the site has a frameset,
then grab the content it points to and store that for the parent site.

I imagine this can only be achieved with a patch to the source.  Since I
don't have the time to dig back into wget's source to do this, I'm
prepared to personally pay for this change.

If anyone feels up to it (and has experience patching wget to add
functionality), drop me an email.

Thanks
Henry


---
* my patches:
--content-type=LIST     comma-separated list of accepted content-types.
--content-type-exclude=LIST  comma-separated list of rejected content-types.
--max-url-len=NUMBER    accept maximum NUMBER URL length.
--max-files=NUMBER      maximum number of files to download.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]