[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev Non-interactive lynx
From: |
Duncan Simpson |
Subject: |
Re: lynx-dev Non-interactive lynx |
Date: |
Sun, 18 Mar 2001 04:16:01 +0000 |
> In "lynx-dev Non-interactive lynx"
I think it would still provoke those who spend time and consideration
> on which of their files have;
> <META NAME="robots" CONTENT="all/none/nofollow/noindex">
> and so forth. Also bear in mind that no robot can read copyright
> notices in the body of a page.
>
Does wget notice this? The robot exclusion protocol I know about is different:
/robots.txt contains a set of glob patterns that robots covering the pages
that robots should not read. I am pretty sure this is what wget knows about.
If a site features the META tags you suggest then it almost certaintly
provides robots.txt as well.
> Just wondered: how easy/hard would it be to make Lynx obey robot
> exclusion protocols in non-interactive mode? This is also done
> with HTTP headers?
>
Obeying robots.txt should be eminently possible given a C library with a
working version of fnmatch in it. It might make sense to provide an
implementation of fnmatch for those with deficient C library, for example the
C library M$ visual C++ provides probably lacks fnmatch---I know both alloca
and getopt are absent. If you need a symbol table mapping sites to a related
robots.txt then I have a splay tree implementation that should be fairly easy
to adapt for that purpose.
We can steal an implementation of fnmatch from glibc, since lynx is
distributed under the GPL.
--
Duncan (-:
"software industry, the: unique industry where selling substandard goods is
legal and you can charge extra for fixing the problems."
; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden