|
From: | Tobias Geerinckx-Rice |
Subject: | bug#52338: Crawler bots are downloading substitutes |
Date: | Fri, 10 Dec 2021 23:52:51 +0100 |
All, Mark H Weaver 写道:
For what it's worth: during the years that I administered Hydra, I found that many bots disregarded the robots.txt file that was in place there. In practice, I found that I needed to periodically scan the access logs for bots and forcefully block their requests in order to keep Hydra frombecoming overloaded with expensive queries from bots.
Very good point.IME (which is a few years old at this point) at least the highlighted BingBot & SemrushThing always respected my robots.txt, but it's definitely a concern. I'll leave this bug open to remind us of that in a few weeks or so…
If it does become a problem, we (I) might add some basic User-Agent sniffing to either slow down or outright block non-Guile downloaders. Whitelisting any legitimate ones, of course. I think that's less hassle than dealing with dynamic IP blocks whilst being equally effective here.
Thanks (again) for taking care of Hydra, Mark, and thank you Leo for keeping an eye on Cuirass :-)
T G-R
signature.asc
Description: PGP signature
[Prev in Thread] | Current Thread | [Next in Thread] |