lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Crawling and User Agents


From: David Woolley
Subject: lynx-dev Crawling and User Agents
Date: Wed, 15 Apr 1998 23:07:56 +0100 (BST)

A letter from the operator of www.imdb.com to my ISP's paper magazine
highlights a problem with user-agent and traversal mode.  They are
blacklisting some web accelerator products because they cause hits on the
site which aren't subsequently used (they want a high hit count for their
advertisers, but presumably want the hits to come from different sources).

Because Lynx's crawling support doesn't support robots.txt, it could well
result in the Lynx user-agent string being blacklisted.  (The site is
very heavily restricted by robots.txt.)  I've no reason to believe Lynx
is a blacklist candidate at the moment, though.

If it is not already the case, I would suggest that Lynx should include
a modifier in the user-agent string to indicate that it was crawling and
that there should be NO user interface options to disable that modifier.
I would suggest the modifier include "crawl" rather than "travers*", as
crawl is more likely to be understood by site operators.

Note that my view is that Lynx users should respect a site's wish not
to be crawled, rather than continually modify user-agent strings to
bypass blacklisting.  (As the site is presumably in the UK, I think
that attempts to defeat the policy could be considered an offence under
the UK anti-hacking laws - i.e. unauthorised access - even though a
prosecution would probably bring unwelcome publicity.)

(www.imdb.com is the International Movie Database site, which apparently
has an advertising space based business model.)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]