Improving our handling of robots.txt
As of this evening (3 December 2010), MLBot is using a new policy when processing robots.txt files. This policy only affects robots.txt files that use the "Allow" directive. For example, Google's Webmaster site about robots.txt used this as a sample of how to use Allow.
User-agent: *
Disallow: /norobots/
Allow: /norobots/index.html
In the past, we attempted to process such robots.txt files so that they would be interpreted as the author of this example intended. We found, however, that in doing so we sometimes ended up crawling things that other authors did not intend us to crawl. For example, we ran across this:
User-agent: *
Disallow: /norobots/
Allow: /
And ended up crawling something in the /norobots/ directory. It's obvious to a person what the author's intent is in this particular case, but in the general case there are ambiguities that are difficult or impossible to resolve. The problem gets even more difficult when wildcards enter the picture. We could try to argue that this syntax is erroneous or at least ambiguous, but even if we're "right," we would risk alienating the owner and getting the reputation as a bot that does not respect robots.txt. That is the last thing we want.
According to the Wikipedia article about the Robots exclusion standard, the standard implementation is that the first matching robots.txt pattern always wins. Google does something different, but we don't know how they implement it and they're apparently the only bot that does Allow in that way.
This change will not in any case make MLBot assume more permissions than previous versions. In most cases you won't see any change in behavior. If you do see a change in behavior it will be that MLBot will refuse to crawl something that it had in the past assumed that it was allowed to crawl. In the two cases above, the new behavior will cause MLBot to refuse to crawl any URL in the /norobots/ directory because the "first matching pattern" rule will see the Disallow for that directory and stop checking. If you want MLBot to crawl /norobots/index.html, then you must write:
User-agent: *
Allow: /norobots/index.html
Disallow: /norobots/
Or, if you want to restrict MLBot specifically:
User-agent: MLBot
Allow: /norobots/index.html
Disallow: /norobots/
The "first matching pattern" rule is completely unambiguous and follows the standard that other bots follow. By following it, we vastly reduce the possibility that MLBot will crawl something that you don't want crawled.