About the Robot
I’m conducting a little survey of the web to see what interesting things people are doing on their sites.
The user-agent string of the robot is:
Mozilla/5.0 (compatible; nextthing.org/1.0; +http://www.nextthing.org/bot)
I’ll be blogging here about anything interesting I find, so check back if you want, or subscribe to my RSS feed.
If you have any questions or comments about the robot, please send an e-mail to “robot” at this domain, or leave a comment below. Thanks!
Posts:
Fun with HTTP Headers
July 28th, 2005 at 3:30 AM
How about your robot respecting the robot rules, rather than ignore the robots.txt file?
Your visit:
adsl-68-126-233-177.dsl.pltn13.pacbell.net – – [27/Jul/2005:04:12:01 -0400] “GET / HTTP/1.1” 200 9579 “-” “Mozilla/5.0 (compatible; nextthing.org/1.0; +http://www.nextthing.org/bot)”
July 30th, 2005 at 9:45 AM
In this first pass with the robot, I’m investigating the usage of HTTP headers in GET requests. I’m looking for interesting things in the HTTP headers, so it’s important that I grab the index page of the site, as those pages are much more likely to be dynamically generated, and therefore emit custom or malformed headers. I’m only retrieving the index page at the root of the site, so the load on the server should be insignificant, and I shouldn’t be finding anything the webmaster wants to hide from normal users on the web (all of the domains I’m looking at are in dmoz).
In my next pass, I’ll be grabbing robots.txt files and (hopefully) obeying them as I spider more pages.
July 31st, 2005 at 10:41 AM
Ugh. Another research-type bot that thinks it doesn’t need to read and respect robots.txt. It’s nice that you’ll be adding support for this. Until you do your bot will be banned from my sites.
August 1st, 2005 at 7:28 AM
The robots.txt file exists so that a robot can adhere to the wishes of the respective webmaster.
Just because you as a robot operator have different goals, doesn’t justify violating the rules.
I expected more respect for the standards established by the web community from a blogger of your format.
August 2nd, 2005 at 6:38 PM
My first pass is done. Just to make sure there’s no confusion here, what I did was give my program a list of URL’s to look at. It retrieved those, and only those, URLs and stuck them in a big database. No links were followed (other than Location: redirects, and then only up to 5 times), and no URLs other than those I gave it were retrieved.
I’m familiar with the Robots Exclusion Protocol, and as I read it, what I did in the first pass doesn’t fall under what it was intended for. From their most recent document, the RFC draft, they define what they consider to be web robots:
I just want to be clear that in this first pass there was no “recursive retrieval” going on, as no links were followed.
That said, I should have made retrieving robots.txt my first pass, and then gone on to retrieve my set of URLs while respecting robots.txt. Consider me duly chastised. In any future runs, the bot will respect robots.txt.
Thanks for the feedback.