[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev The traversal limitation
From: |
per . magnus . banck |
Subject: |
Re: lynx-dev The traversal limitation |
Date: |
Fri, 23 Oct 1998 7:15:00 +0100 |
David Woolley <address@hidden> wrote in reply to me:
>>>>> Most sites do contain much more material than I ever want to download
>>>>> over my
>>>>> slow link
>> Many sites object strongly to being crawled as well, because they expend >>
>> bandwidth on pages not read. IMDB is a case in point. Please never crawl
>> that site with Lynx or you will find that Lynx gets permanently barred from
>> it. (The other issue is that mirrored copies breach the copyright and >>
>> deny them the ability to obtain the advertising revenue that pays for the
>> site.)
We are in full agreement over what policy issues are involved in this.
What I tried to do was to _limit_ the number of pages being crawled.
>>>>> So I try to filter the searches from the start file via the reject.dat
>>>>> file
>>>>> and -realm. But in this case, the intersting pages is in a /cgi-bin/
>> Pages aren't normally in /cgi-bin, but rather a program is run to create
>> the page on the fly when you reference URLs of this form. That's
>> particularly
>> expensive for the site and most sites will use the robots.txt file to bar
>> access to well behaved crawlers. Unfortunately, Lynx is NOT well behaved.
>>>>> "The '-traversal' switch is for http URLs and cannot be used for file:"
>>>>> Is there any big security concern behind this limitation?
>>>>> If not, I suggest we skip this test altogether.
Getting Lynx to be well-behaved is surely worth a discussion of its own.
But that is far out of my limited knowledge of Lynx internals..
The point I tried to raise was merely: if any security concern exists behind the
limittion that the startfile is not allowed to be on the local disk
(file://localdisk/...).
Anybody knows ?
.
>> wget is designed for this purpose and does exist in win32 versions. It is
>> well behaved as a crawler, but can be given an explicit list of URLs, on
>> the command line, or in a file, and will then bypass robots.txt. Because it
>> is well behaved, it is less likely to be barred, although excessive use
>> for mirroring a set of pages could still have this effect.
>> It doesn't render the HTML, but can fixup internal URLs so that they work
>> from the local filesystem, allowing you to use Lynx, or another browser, on
>> that copy.
I will try out WGET asap to see if it suits my need - it took a while to find
the
Windows port. But in case any one more wants to try, here it can be found:
http://www.interlog.com/~tcharron/wgetwin.html
[Banck Per Magnus DD GL] /Magnus
=======================================================
Per Magnus Banck address@hidden
Electoral Information Service, Box 4186, SE-10264 Stockholm (Sweden)
=======================================================
- Re: lynx-dev The traversal limitation, (continued)
- Re: lynx-dev The traversal limitation, Philip Webb, 1998/10/02
- Re: lynx-dev The traversal limitation, Larry W. Virden, 1998/10/02
- Re: lynx-dev The traversal limitation, Rick Lewis, 1998/10/02
- Re: lynx-dev The traversal limitation, Larry W. Virden, 1998/10/02
- Re: lynx-dev The traversal limitation, Rick Lewis, 1998/10/02
- Re: lynx-dev The traversal limitation, Larry W. Virden, 1998/10/02
- Re: lynx-dev The traversal limitation, brian j. pardy, 1998/10/02
Re: lynx-dev The traversal limitation, Roving Reporter, 1998/10/02
Re: lynx-dev The traversal limitation, David Woolley, 1998/10/04
Re: lynx-dev The traversal limitation,
per . magnus . banck <=