About the Robot

I’m conducting a little survey of the web to see what interesting things people are doing on their sites.

The user-agent string of the robot is:

Mozilla/5.0 (compatible; nextthing.org/1.0; +http://www.nextthing.org/bot)

I’ll be blogging here about anything interesting I find, so check back if you want, or subscribe to my RSS feed.

If you have any questions or comments about the robot, please send an e-mail to “robot” at this domain, or leave a comment below. Thanks!

Fun with HTTP Headers

5 Responses to “About the Robot”

  1. webmaster Says:

    How about your robot respecting the robot rules, rather than ignore the robots.txt file?

    Your visit:

    adsl-68-126-233-177.dsl.pltn13.pacbell.net – – [27/Jul/2005:04:12:01 -0400] “GET / HTTP/1.1” 200 9579 “-” “Mozilla/5.0 (compatible; nextthing.org/1.0; +http://www.nextthing.org/bot)”

  2. Andrew Says:

    In this first pass with the robot, I’m investigating the usage of HTTP headers in GET requests. I’m looking for interesting things in the HTTP headers, so it’s important that I grab the index page of the site, as those pages are much more likely to be dynamically generated, and therefore emit custom or malformed headers. I’m only retrieving the index page at the root of the site, so the load on the server should be insignificant, and I shouldn’t be finding anything the webmaster wants to hide from normal users on the web (all of the domains I’m looking at are in dmoz).

    In my next pass, I’ll be grabbing robots.txt files and (hopefully) obeying them as I spider more pages.

  3. GaryK Says:

    Ugh. Another research-type bot that thinks it doesn’t need to read and respect robots.txt. It’s nice that you’ll be adding support for this. Until you do your bot will be banned from my sites.

  4. schorsch Says:

    The robots.txt file exists so that a robot can adhere to the wishes of the respective webmaster.
    Just because you as a robot operator have different goals, doesn’t justify violating the rules.

    I expected more respect for the standards established by the web community from a blogger of your format.

  5. Andrew Says:

    My first pass is done. Just to make sure there’s no confusion here, what I did was give my program a list of URL’s to look at. It retrieved those, and only those, URLs and stuck them in a big database. No links were followed (other than Location: redirects, and then only up to 5 times), and no URLs other than those I gave it were retrieved.

    I’m familiar with the Robots Exclusion Protocol, and as I read it, what I did in the first pass doesn’t fall under what it was intended for. From their most recent document, the RFC draft, they define what they consider to be web robots:

    Web Robots (also called “Wanderers” or “Spiders”) are Web client programs that automatically traverse the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

    I just want to be clear that in this first pass there was no “recursive retrieval” going on, as no links were followed.

    That said, I should have made retrieving robots.txt my first pass, and then gone on to retrieve my set of URLs while respecting robots.txt. Consider me duly chastised. In any future runs, the bot will respect robots.txt.

    Thanks for the feedback.

Leave a Reply

Please spell "response" backwards: (required)