Last October I got bored and set my spider loose on the robots.txt files of the world. Having had a good deal of positive feedback on my HTTP Headers survey, I had decided to poke around in robots.txt files and see what sorts of interesting things I could find.
To those of you completely unaware of what this post is about, here’s a brief primer. Google is a search engine. You probably use it. If not, odds are you use one of MSN Search (now called “Live Search”), Ask Jeeves (now Ask.com), or Yahoo! Search. How do those search engines grab web pages to search? Well, they use robots, also called spiders. Now, these aren’t the giant metal machines you see chasing tweaked out English factory workers through the streets of London, nor are they the giant eight-legged creatures you find lurking behind clocks. Rather, they’re pieces of software that surf around the web grabbing web pages. Since they’re software, they can surf the web much faster than humans, as well as find things most humans might overlook. As such, there arose a need for a standard for advising robots on what they should and shouldn’t look at.
The Robots Exclusion Protocol arose in June 1994 by consensus among a number of web spider developers. The original protocol description from 1994 describes the basic syntax of a robots.txt file to be placed at the root of a web site. So, for example, Google would place their robots.txt file at:
The basic format goes something like this. First, the file specifies a User-agent (the name of the robot) that is to follow the subsequent rules (until the next User-agent line):
This line tells “SuperHappyRobot” that it needs to pay attention to the next few lines. Any other robot will ignore these rules. The next line might look something like:
Which would mean SuperHappyRobot shouldn’t download any pages that start with the path “/tmp/” from this server. Variations on these lines are that * will match any robot name (in other words, “User-agent: *” should tell all the robots to pay attention), and blank Disallow statements mean anything goes. So, Apple’s robots.txt file of:
# robots.txt for http://www.apple.com/ User-agent: * Disallow:
means, essentially, that any robot is free to grab any page it can get its hands on, at least for the “www.apple.com” website.
So, that was all well and good, but around 1996 there was a push to try to get robots.txt standardized, and an IETF draft (http://www.robotstxt.org/wc/norobots-rfc.html) was produced that clarified and added to the robots.txt syntax. The primary addition was a new “Allow” rule, which allowed a little more fine-grained control over which pages could be retrieved. For example, with the following set of rules:
User-agent: * Disallow: /apache/ Allow: /apache/02/03/11/2228242.shtml
All documents except “/apache/02/03/11/2228242.shtml” in the “/apache/” path would be excluded from spidering. There was also a provision for “extensions” to the protocol, such that a rule line like “Crawl-delay: 10” could be added. Spiders that didn’t support that extension would ignore it, while spiders that did might delay 10 seconds between page fetches.
Around the same time the IETF draft was being discussed, Sean “Captain Napalm” Conner proposed his own extension to the Robots Exclusion Protocol, which included Allow rules as well as regular expression syntax for rules, and new Robot-version, Visit-time, Request-rate, and Comment rules. Less than 100 of the sites I visited use rules unique to this spec.
Since none of these three documents have ever been ratified or adopted by a standards body, there has been a bit of persistent confusion over what constitutes a valid robots.txt document. The most definitive document is certainly the original 1994 document. Most commercial robots today, however, attempt to conform to the IETF draft document. And, given the large number of Allow rules around, it would be remiss of a robot not to try.
A Touch of Controversy
This de-facto standard has had its share of controversy over the years. Many webmasters object to having to opt-out of spiders crawling their site. Given that I found 47,738 sites that disallow spidering the root of their site with the wildcard (*) user-agent match, it appears that that viewpoint still has many adherents, and many just want to be left alone by the bulk of spiders. See the comments in this thread for some examples of this opinion from some relatively tech-savvy webmasters. Among them is the well-known IncrediBILL:
Lack of a robots.txt file should mean just that, they don’t know about robots so robots should STAY THE HELL OUT!
I’ll come back to this later.
Others have objected to the idea of putting up a roadmap to secret pages on their sites. Bertrand Meyer, the designer of Eiffel (the programming language, not the Tower) and a Very Smart Person even holds this viewpoint. To quote:
If you are just a bit absent-minded, isn’t it natural
to use this mechanism to exclude stuff from being indexed and hence believe
no one will find it? “Stupid”, maybe — but not unlikely.
Indeed, scanning through the robots.txt files I pulled down, I find disallow rules for 3,000+ “phpMyAdmin” paths, 40,000+ “stats” paths, 31,000+ “log” paths, 400+ “secret” paths, 100,000+ “admin” paths, and a host of other interesting looking entries. Even if the vast majority of these are properly secured with authentication, the chances of a few people being absent-minded, as Bertrand might say, are pretty good.
On the flip side of these opinions, there are those who have always viewed, and want to continue to view, robots.txt as a merely advisory standard. As courts and legislative bodies have begun to apply the force of law to this loose consensus protocol, some have spoken out in favor of information transparency and the essential openness of the Internet, including Marijn Koster, the creator of the protocol:
“I don’t think that’s in the spirit of free information exchange,” Koster says. Some robots may have legitimate reasons to ignore robot exclusion directives. For example, he says, a company might use robots to hunt for copyright infringing content.
Having written a spider for my HTTP headers survey and run it against all of the domains in the Open Directory, I already had a large collection of web sites, and a decent spider. I further added to my list of domains by extracting links from the pages I’d downloaded for that project. Then, I ran my spider (written in Python, using PycURL) against this expanded list of domains, attempting to retrieve the robots.txt file at each. The HTTP headers and full body of the response were stored in a MySQL database. This database was then dumped via a custom “Big File” implementation, which amounted to a bit more than 12GB on disk. Then, I wrote an analyzer which could run through this logical file, processing the records, recording interesting statistics about the entries and reporting the results. This analyzer takes about half an hour to run on the dataset. In total, I received responses from about 4.6 million unique domains.
HTTP status codes (aka response codes) tell web browsers and robots both what kind of response they’re getting when they download a page. For example, “200” means everything is okay and “404” means the web server couldn’t find the file the browser requested. The IETF robots.txt spec says that a 404 response for robots.txt means the site is unrestricted for robots, and a 2XX response means the robot must respect the returned robots.txt content. Other status codes have recommended behaviors, but they’re not required.
Status codes are interesting primarily because they give a quick count of how many sites have a robots.txt file. I got responses from 4.6 million sites, so by tallying the response codes of different types, I can tell who has a robots.txt file and who doesn’t:
Broken down by class, we get:
|Class||Count||% of Total|
As we can see above, around 65% of sites return a 4XX status code, indicating they don’t have a robots.txt file. Another 7.6% redirect to a different URL, usually either the home page or an error page. This means, essentially, that about 26% of sites are attempting to serve up a valid robots.txt file. Of course, some sites may improperly return an error page with a 2xx status code, so this is only useful as a quick estimate.
MIME types (aka content types) are returned in the headers of HTTP responses by web servers to tell clients what the document’s type is. They consist of a type (text, image, etc), a subtype (like html or jpeg) and some other optional parameters (like the character encoding). So, for example, an HTML file usually has a MIME type like “text/html” and a text file a type like “text/plain”. An image file might have a MIME type like “image/gif” or “image/jpeg”. The IANA keeps an official list of registered MIME types at http://www.iana.org/assignments/media-types/.
The only MIME type that should be returned for a valid robots.txt file is text. True, the specs don’t specifically mention MIME types, but sites like Google follow the general HTTP rule of “if it’s not text/*, it’s not really plain text”. Of the robots.txt files I got back, 109,780 of them had MIME types other than text/plain. So, it should be no surprise that the big 3 search engines (Yahoo!, Google, and MSN) all will attempt to parse any text robots.txt file they get back from the server. For example, Digg.com serves up their robots.txt file as “text/html; charset=UTF-8”. Google, MSN, and Yahoo! all obey the rules in the file.
Besides for text/html and text/plain, some of the more common MIME types I got back were application/octet-stream, application/x-httpd-php, text/x-perl (mostly error pages), video/x-ms-asf, application/x-httpd-cgi, image/gif, and image/jpeg.
Even among files ostensibly marked as text, there were a wide variety of questionable MIME types:
No, Really, Robots Dot TEXT
An error similar to using the wrong content type is uploading a robots.txt file in a format other than plain text. Popular mistakes here include Word documents (examples: 1, 2, 3), RTF documents (examples: 1, 2, 3), and HTML. I even found LaTeX and KOffice documents.
One piece of server software (called Cougar, which looks, as near as I can tell, to be either Microsoft Small Business Server or IIS), even spits out ASF streaming video files when asked for a robots.txt file (examples: 1, 2). Fun.
Character encodings specify what letters and other characters correspond to which specific bits. Sites specify what character set a response is in within the Content-type header. Some sites serve up robots.txt files in little-used encodings, such as UTF-16. UTF-16 is tricky for a number of reasons, not the least of which are the different endian encodings. Of the 463 UTF-16 files I found, approximately 10% were not valid UTF-16, even though they included a UTF16 BOM.
Otherwise, I saw close to 300 unique character sets claimed by servers, even discarding obviously incorrect ones and making them all lower case. These included some ones I hadn’t seen before, like “nf_z_62-010”, “ibm-939”, and “fi_fi.iso-8859-15@euro”.
robots.txt have one and only one proper way to comment, which is to put comments after a hash mark (#). However, I found HTML comments (), C++ style comments (//), and a variety of others, including simple in line comments.
Some people seem rather befuddled as to what constitutes a robots.txt file. For example, the most common confusion I’ve found is people using the raw text dump of the Web Robots Database as their robots.txt file. I’m not just talking about a couple of sites, either. Approximately 1 in every 1000 websites I looked at do this. It’s really quite bizarre. This seems to be part of a more general mistake wherein people copy instructions on how to set up a robots.txt file into the contents of robots.txt files. For example, here are a few: www.cooljobscanada.com, www.numis.co.uk, www.volubilis2000.com, www.kickapoo-orchard.com, www.aplussupply.com.
A list of videogames. Several .htaccess files. Access logs. Lists of keywords and website descriptions, including an actual keyword stuffing example. Bash scripts, PHP pages, and everything in between.
There’s even a description of a swimming pool. In German.
Apparently there’s another protocol, similar to robots.txt, for advertising the contact information for a site. A file called info.txt is supposed to be placed in the root of the site, which sites like Alexa will look for when trying to find out who owns the domain. I found a lot of these records in the robots.txt files.
Someday I’ll have to see how many of these there are in the wild.
There are no wildcards (also known as pattern matching) in the official robots.txt specs, but various search engines have added extensions to support this.
For example, Google, MSN Search, and Yahoo! allow an asterisk (*) to match any sequence of characters, and a dollar sign ($) to match the end of the URL. So, to block spiders from downloading any JPEG image files, one might use:
User-agent: * Disallow: /*.jpg$
Indeed, blocking spidering of certain file types is the most popular use for wildcards. Most people who are using wildcards for anything else are doing so entirely unnecessarily. For example, a lot of sites have the following rule:
The use of the non-standard wildcard above is useless, as this rule is equivalent to:
This is because rules are by default partial paths, and will match any path beginning with that string. It’s also worth noting that of all the sites which have the above rule with the wildcard, none of them have the rule without the wildcard. So, a spider which didn’t support pattern matching would be free to download urls that start with “/RealEstateTips/”, so long as they didn’t have an asterisk after the second slash.
Common Syntax Errors
So, besides for the above, what are some of the common errors? The spec says that records are separated by blank lines, and the most common errors center around that. First most is putting a blank line between a User-agent line and the rules that should apply to it, with 74,043 files doing this. Next up is the placement of a Disallow or Allow rule with no User-agent or Disallow/Allow rule immediately before it, with 64,921 files making this mistake. The next is placing a User-agent line immediately after a Disallow/Allow line, with no space in between. 32,656 files did this. Finally, lines which were neither comments, nor blank, nor rules showed up in 22,269 files.
The IETF robots.txt draft spec includes a provision for extensions to the robots.txt format. Basically, along with “Allow” and “Disallow” lines, spiders can optionally support extensions for enhanced control over the robot’s behavior. The most widely-deployed of these is the Crawl-delay extension.
MSN Search, Yahoo!, and Ask all support Crawl-delay, which is used to insert a delay between successive accesses of a web server. A typical Crawl-delay might look something like this:
User-agent: * Crawl-delay: 5
Which spiders that support Crawl-delay would interpret as meaning they should wait 5 seconds between requests to the site. I found tens of thousands of these entries.
I found a LOT of typos in these files. You wouldn’t think it would be very hard to spell the limited vocabulary of “User-agent” and “Disallow” correctly, but you’d be wrong. For example, I found 69 typos of Disallow. 69! That’s not even counting the ones I found with weird characters in the middle of the word.
Fingerprinting Using robots.txt
Sometimes, we can use robots.txt file contents for fingerprinting the sites that serve them up. For example, we can fingerprint the sites designed by Moriah.com by looking for robots.txt files with the contents:
this file placed here so you don't fill up my error log looking for it :-)
Similarly, we can find the more than 7,000 real estate sites designed by Advanced Access by looking for the rule:
More usefully, we can identify one Korean domain squatter by looking for robots.txt files that contain only a meta tag like:
meta http-equiv=refresh content='0;url=http://www.hiplayer.com'
(brackets excluded because of a bug in WordPress).
At the time I spidered, we could identify another domain squatter by looking for a robots.txt file like:
User-agent: * Disallow: /pixel/ Disallow: /library/ Disallow: /results_monitor.asp
They’ve since switched to a more generic, but still easily-identifiable robots.txt file.
Using similar methods, it’s easy to find a lot more domain squatters, mass-hosted websites, etc. A search engine could potentially maintain a list of such signatures and, based solely on the robots.txt file, not bother indexing the page. Or, more generally, it could increase or decrease the relevance and ranking of the site in its search results.
Okay, so what conclusions can we draw from this mess of data? The primary conclusion, I think, is that the Robots Exclusion Protocol is more complicated than it actually seems. As a spider, in order to properly parse the variety of robots.txt files you’ll find in the wild you’ll need to write an extremely lenient parser (following the Robustness Principle), mostly ignore content types, handle a variety of character encodings (and in many cases ignore those returned by the server), detect HTML and other content returned in the guise of robots.txt files, and potentially implement multiple extensions to the accepted standard.
How about the position, discussed above, that spiders shouldn’t spider or download content without the explicit permission of the webmaster? Belgium has certainly come down on the side of requiring explicit permission. However, the evidence shows that Google is in the right on this one:
“Given the vast size of the Internet, it is impossible for a search engine to contact personally each owner of a web page to determine whether the owner desires its web page to be searched, indexed or cached… If such advanced permission was required, the internet would promplty grind to a halt,” Google’s senior counsel and head of public policy Andrew McLaughlin told the Senate Legal and Constitutional Affairs Committee.
As seen in the status codes section, if this were to happen, nearly three quarters of domains on the web would go “dark” for search engines. If these sites went dark for search engines, they would essentially be offline for the majority of web users. Such an action would be in nobody’s best interest; not the site owner’s and certainly not in those of the web-using public at large.
On a less serious note, it’s always interesting to see just how vast the Internet really is. Few things drive that home for me as much as seeing how varied the content people generate on the web can be.
So, until next time, I leave you with a quote from one of the robots.txt files I came across:
are you searching something??? 🙂
Yes. Yes I am. And so far, every time I look, I find it.