Weblog entry #17 for lee

And it absolutely will not stop...
Posted by lee on Wed 26 Apr 2006 at 12:25
Tags: none.

$ cat robots.txt

User-agent: msnbot/1.0
Disallow: /subdirectory/

Dear msnbot, to quote RFC2616

10.4.11 410 Gone
The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent.

To me, this means don't continuously check hundreds of the same pages marked as "Gone", every day for weeks on end.

 

Comments on this Entry

Posted by Steve (212.20.xx.xx) on Wed 26 Apr 2006 at 13:45
[ Send Message | View Steve's Scratchpad | View Weblogs ]

I have spiders constantly trying to mirror my website(s), and making mass-downloads of pages. Frequently ignoring the /robots.txt file.

I wouldn't mind if they didn't make lots of requests with no pauses between them - and didn't make utterly broken requests.

Here is my favourite malformed request for today:

GET /articles/388#comment_1 HTTP/1.1
..
GET /articles/388#comment_50 HTTP/1.1
..
GET /articles/388#comment_60 HTTP/1.1

The # character for a named anchor should never be sent in a HTTP request. And people wonder why I ban IP addresses ..

Steve

[ Parent | Reply to this comment ]

Posted by Anonymous (213.164.xx.xx) on Wed 26 Apr 2006 at 14:09
Ugh.

Why not add
Crawl-delay: 10
to your robots.txt file?

Some robots support it..

[ Parent | Reply to this comment ]

Posted by Steve (212.20.xx.xx) on Wed 26 Apr 2006 at 15:08
[ Send Message | View Steve's Scratchpad | View Weblogs ]

I suspect that robots making:

  • Bogus requests.
  • Requesting things expressly forbidden in the existing robots.txt file.
  • Multiple requests per second.

are probably not going to honour any setting I add to be honest..

Steve

[ Parent | Reply to this comment ]

User Login

Username:

Password:

[ Advanced Login ]

Register Account

Quick Site Search