Re: lynx-dev Non-interactive lynx

From: Duncan Simpson
Subject: Re: lynx-dev Non-interactive lynx
Date: Sun, 18 Mar 2001 04:16:01 +0000

> In "lynx-dev Non-interactive lynx"
 I think it would still provoke those who spend time and consideration
> on which of their files have;
>       <META NAME="robots" CONTENT="all/none/nofollow/noindex">
> and so forth.  Also bear in mind that no robot can read copyright
> notices in the body of a page.
Does wget notice this? The robot exclusion protocol I know about is different: 
/robots.txt contains a set of glob patterns that robots covering the pages 
that robots should not read. I am pretty sure this is what wget knows about. 
If a site features the META tags you suggest then it almost certaintly 
provides robots.txt as well.
> Just wondered: how easy/hard would it be to make Lynx obey robot
> exclusion protocols in non-interactive mode?  This is also done
> with HTTP headers?
Obeying robots.txt should be eminently possible given a C library with a 
working version of fnmatch in it. It might make sense to provide an 
implementation of fnmatch for those with deficient C library, for example the 
C library M$ visual C++ provides probably lacks fnmatch---I know both alloca 
and getopt are absent. If you need a symbol table mapping sites to a related 
robots.txt then I have a splay tree implementation that should be fairly easy 
to adapt for that purpose.

We can steal an implementation of fnmatch from glibc, since lynx is 
distributed under the GPL.

Duncan (-:
"software industry, the: unique industry where selling substandard goods is
legal and you can charge extra for fixing the problems."

