[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Savannah-hackers-public] Web Crawler Bots
From: |
Bob Proulx |
Subject: |
[Savannah-hackers-public] Web Crawler Bots |
Date: |
Fri, 6 Jan 2017 15:00:34 -0700 |
User-agent: |
NeoMutt/20161126 (1.7.1) |
Just a general observation to throw out to the group...
Spending some time looking at the web logs. It looks to me that a lot
of web crawler bots are ignoring the robots.txt file. I am seeing a
lot of bots crawling the /viewvcs interface for both svn and cvs and
often the gitweb interface too.
On the new servers this isn't causing an unacceptable load. It is
nice having fast capable systems now. But it is a continuous
undesirable activity. That is a huge tree of possible dynamically
created web pages following every possible version and branch in
hundreds of repositories. Not infinite by any means but defintely
dynamic, large, and always growing.
I am contemplating what should be done to deal with them. Initially I
would write up fail2ban rules based upon known bot User-Agent
signatures and then rate limit their hits of the path. That would
allow limited use of any robot in small quantities such as when a url
is mentioned and an IRC bot retrieves the title or summary of it. But
it would block bots endlessly crawling the trees.
Any better ideas?
UPDATE: I figured out that the cvs.sv.gnu.org/robots.txt file wasn't
available in our current configuration. It was shadowed. That
probably accounts for a lot of the robots. Fixed that. So not doing
anything for the moment. Will wait a day and check to see if having
the cvs site robots.txt file available is the difference, probably is
a lot of it, and then will decide. I think in that case a fail2ban
rule is still a good idea. But it only blocks robots that say they
are a robot making it a fairly blunt instrument. If people have
experience with smarter ways of managing that I would still be
interesting in learning about it.
Bob
- [Savannah-hackers-public] Web Crawler Bots,
Bob Proulx <=