Archive for March, 2007

Microsoft Web

Saturday, March 31st, 2007

Here’s a snippet from Microsoft’s current corporate home page:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en" dir="ltr"> <head> <META http-equiv="Content-Type" content="text/html; charset=utf-16"> <title>Microsoft Corporation</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="SearchTitle" content="Microsoft.com"> <meta name="SearchDescription" content="Microsoft.com Homepage"> Download this code: /code/microsoft.txt

Can you spot the problem?

Update: James Booker is the first in with an answer: there are two Content-Type meta tags above, both of which specify different character sets.

Specifying the content-type in meta tags is a bit of a hack, as the browser has to seek through the first section of the document looking for a content-type declaration, then try reinterpreting the page with the character set the page specifies. Specifying a character set of "utf-16" doesn't make any sense in this scenario, as the browser is going to try the sniffing by interpreting the HTML as ASCII. If the page were actually UTF-16, this wouldn't work, as the representation for the string "Content-type" in UTF-16 isn't identical to its representation in UTF-8, as we can see in a Python shell:
>>> "Content-type".encode("utf-16") '\xfe\xff\x00C\x00o\x00n\x00t\x00e\x00n\x00t\x00-\x00t\x00y\x00p\x00e' >>> "Content-type".encode("utf-8") 'Content-type'

Thankfully, the HTML spec foresaw this problem:

The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.

So, there are actually three problems with the above HTML. There are two content-type declarations in meta tags, one of them is bogus, and the correct one isn't as early as is possible in the head element. These problems, thankfully, are mitigated by the presence of an HTTP header that specifies the correct character set, and by the incredible amount of effort browser vendors have put into making their code accepting of mistakes such as these.

robots.txt Adventure

Monday, March 12th, 2007

Introduction.txt

Last October I got bored and set my spider loose on the robots.txt files of the world. Having had a good deal of positive feedback on my HTTP Headers survey, I had decided to poke around in robots.txt files and see what sorts of interesting things I could find.

Since then, I’ve taken 6 weeks of vacation and gotten to be very busy at work, so I’m just now getting around to analyzing all the data I gathered. These are some of the results of that analysis. (more…)