Weblog entry #17 for lee
$ cat robots.txt
User-agent: msnbot/1.0 Disallow: /subdirectory/
10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent.
To me, this means don't continuously check hundreds of the same pages marked as "Gone", every day for weeks on end.
Comments on this Entry
[ Send Message | View Steve's Scratchpad | View Weblogs ]
I have spiders constantly trying to mirror my website(s), and making mass-downloads of pages. Frequently ignoring the /robots.txt file.
I wouldn't mind if they didn't make lots of requests with no pauses between them - and didn't make utterly broken requests.
Here is my favourite malformed request for today:
GET /articles/388#comment_1 HTTP/1.1 .. GET /articles/388#comment_50 HTTP/1.1 .. GET /articles/388#comment_60 HTTP/1.1
The # character for a named anchor should never be sent in a HTTP request. And people wonder why I ban IP addresses ..
[ Parent | Reply to this comment ]
Why not add
Crawl-delay: 10
to your robots.txt file?
Some robots support it..
[ Parent | Reply to this comment ]
[ Send Message | View Steve's Scratchpad | View Weblogs ]
I suspect that robots making:
- Bogus requests.
- Requesting things expressly forbidden in the existing robots.txt file.
- Multiple requests per second.
are probably not going to honour any setting I add to be honest..
[ Parent | Reply to this comment ]