savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Savannah-hackers-public] Web Crawler Bots


From: Bob Proulx
Subject: Re: [Savannah-hackers-public] Web Crawler Bots
Date: Sat, 7 Jan 2017 17:54:30 -0700
User-agent: NeoMutt/20161126 (1.7.1)

Karl Berry wrote:
> Bob - as you probably know, there are some existing fail2ban filters for
> this -- {apache,nginx}-botsearch.conf are the most apropos I see at
> first glance. fail2ban is the only scalable/maintainable way I can
> imagine to deal with it.

I like fail2ban quite a bit too.  It is a really good tool.  But I
know you have been tuning up your system with custom rules much more
than I have done.  You have passed me by in the expertise of creating
those rules. :-)

Mostly I was looking for comments or feedback on whether DROPing
packets from crawlers is a good or bad thing to do.  Seems okay to
me.  Especially if they are errant.  In this case I think the main
problem has been lack of appropriate robots.txt files.  We had them
but they were shadowed by other redirection rules which prevented
their public visibility.

First I am going to get the robots.txt files visible and then wait a
bit for the crawlers to observe them.  It isn't fair to blame them if
they haven't been told their limits yet.  For the cvs and svn sites
that just got a robots.txt file visible I see a noticable drop-off of
crawlers.  Looks like that has been enough for those.  Just need to
work through the entire collection.  (And finding some errors in
unrelated things that need to be fixed along the way.)

> A nonscalable/nonmaintainable way ... for tug.org, years ago I created a
> robots.txt based on spammer user-agent strings I found at
> projecthoneypot.org
> (https://www.projecthoneypot.org/harvester_useragents.php nowadays, it
> seems). It's still somewhat beneficial, though naturally it was surely
> out of date the instant I put it up, let alone now. I also threw in
> iptable rules by hand when the server was getting bogged down. I hope
> one day I'll set up fail2ban (including recidive) for it ... -k

That project looks pretty interesting.  I browsed around through it.
I am bookmarking it for more study.  But I am pretty sure I can
generate a good list from our current logs.

For us we don't need a lot of fine grained control.  All of the
robots.txt files on the vcs web, which is all dynamically
created and almost all endless versions of projects, is this:

User-agent: *
Disallow: /

All of those sites are individual sites.  If I find something that is
not dynamically generated content then that can be allowed.  But
otherwise it needs to be hands off completely.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]