Why focused crawling?

When I say that we've built a high performance focused crawler, I am often asked, "What's a focused crawler?" Most people understand the "crawler" part and even have a vague understanding of how a Web crawler works. But a "focused" crawler is a foreign concept to most people, although a short explanation makes it obvious why every crawler should be focused to some extent.

So, again, why focused crawling? Because the Web is big. Really big. WorldWideWebSize.com uses statistical sampling of Google, Bing, and other search engines to estimate the size of the indexed Web. The site says that they purposely underestimate the size, and it the estimates vary quite widely. Today's (2010-12-28) estimate is "at least 13.97 billion pages," but in the past the estimates have been as high as 65 billion. 20 billion is a good round number, and well within the historical estimation range."

20 billion is a big number. It's a lot of pages.

Not only is the Web big, but most of it is irrelevant to what you're searching for. If you're looking for information about steam trains, then information about rabbits, airplanes, rock music, computer programming, or millions of other topics is either irrelevant or at best marginally relevant to your search. Simply put, a focused crawler attempts to limit the amount of data you have to look through in order to find relevant information.

In general, we can lump the use of crawlers into three broad categories:

  • Limited crawlers that examine a small handful of sites looking for very specific documents. These crawlers are typically given a list of URLs to examine in depth, and are prevented from exploring outside a very narrow range. This type of crawler is more of a scraper in that it's not really "looking for" anything, but rather gathering information in known locations. These crawlers can be invaluable business or research tools, and projects like Scrapy make building and operating them much easier.
  • Breadth-first crawlers that attempt to index the entire Web. This is what most people imagine when they think of Web crawlers. They imagine a huge bank of computers connected to the Internet through a big pipe, sucking down Web pages like mad and indexing them for quick lookup. Crawlers like Googlebot, MSNBot, Yahoo Slurp, etc. fit into this category. These crawlers are incredibly difficult to build and require a vast infrastructure of cooperating computers, huge amounts of data storage, and gigabit-level Internet connections in order to download and index billions of Web pages. Building and operating a general crawler like this is hugely expensive.
  • Crawlers that use various techniques to limit what kind of documents they examine, or what sites they visit in order to find potentially useful documents. A crawler of this type is given a list of starting places (seed URLs), and some rules to help direct its search. This is the broad category of focused crawlers, and in fact every crawler these days uses some focused techniques in order to do its job.

The world of focused crawlers covers the entire gamut from the lowliest page scraper to the biggest and baddest Web bot out there. You most often find them, though, in the lower end of that middle ground--just above the closely targeted scraper. They're looking for particular types of information, and typically use machine learning techniques to determine where and when to examine what parts of the Web. Whereas a scraper looks at specific things and a breadth-first crawler tries to look at everything, a focused crawler tries to look only at things that are "interesting," with "interesting" being defined as documents that contain information relevant to the search, or likely to lead to such documents.

So why is that important? Let me give you an example. A Google search today for steam trains reports 794,000 results. If we assume that there are 20 billion pages indexed on the Web, then approximately one in 25,000 pages will have information at least tangentially related to steam trains.

If we were to crawl the Web at random--just grab any old URL and download the document--then, on average, we would expect to find one document about steam trains in every 25,000 documents we examine. That's assuming, of course, that we somehow limited the crawler to looking only at URLs that are indexed, and we don't download things like videos, images, and other types of things that aren't "documents." We would have to examine 25 million documents in order to get 1,000 references to steam trains, and there's no telling how relevant those documents might be. For example, we might find this page, which mentions steam trains but isn't really relevant to them.

How long it takes to look at 25 million pages is determined by the speed of the Internet connection and the average size of a Web page. In over three years of crawling the Web, we've found that the average size of an HTML page is about 40 kilobytes. So 25 million pages would be approximately one thousand million kilobytes, or about a terabyte of data. A 10 megabit Internet connection will give you about 1 megabyte of data per second, meaning that you could crawl about 2.2 million pages per day. It would take you almost 12 days to crawl 25 million pages and get 1,000 possibly relevant pages about steam trains.

The first version of our our MLBot crawler was quite naive in this way. We gave it a starting point and it began scouring the Web looking for media files. We soon learned a few things:

  • Approximately one URL out of 5,000 that we looked at contained some sort of media (a video or audio file).
  • Approximately one HTML page out of 10,000 that we looked at contained a link to a media file.
  • Pages that linked to media files tended to cluster--they linked to each other.
  • There are many small clusters and a few really big clusters of media files scattered throughout the Web.
  • A very small number of HTML pages serve as hubs; they link to multiple clusters.
  • The vast majority of the Web is totally irrelevant when it comes to locating media files.

It was at this point that we asked the question that lead us to focused crawling:

 

What if we knew where to look, and could ignore everything else?

 

The ideal, of course, would be the "genie" that could just give us a list of every media file and the pages that link to those files. Then we wouldn't need to crawl at all--just harvest the files. The genie doesn't exist, of course, but we could use the information we gathered to refine the crawler's behavior. We started by doing the following:

  • Suspected media files (those URLs ending with ".mp3", ".flv", etc.) were moved to the head of the queue so that they were crawled first.
  • Things that were obviously not or very unlikely to be media or lead to media (".doc", ".jpg", ".pdf", etc.) were never placed into the queue to be crawled.
  • Sites that were unlikely to lead to media (message forums like "forums.example.com", and other such sites) were also excluded from the queue.
  • We instituted a "futility" cutoff. The crawler kept track of the depth to which it crawled, and if it didn't find media after crawling to a particular depth, that line of crawling was abandoned.

The first two items were obvious refinements and very easy to implement. The third--blocking unproductive sites--was easy enough to implement but became unwieldy in short order. It's simply impossible to manually identify every unproductive site. The last was easily implemented, too, and gave a huge improvement in performance. It turned out that if the crawler didn't find a link to a media file after crawling five levels deep, it was very unlikely ever to find one along that path.

Those simple changes gave us an order of magnitude improvement in our crawler performance. Where we had been finding one media file out of 5,000 documents, we were now getting one media file out of 500 URLs crawled. At this point we began wondering just how close we could come to the magic genie, and we began doing some research and developing a simple machine learning system that could direct (focus) the crawler more productively.

Our first ML system consisted of a naive Bayes classifier that examined the terms in URLs to assign a probability that the URL was a media file or would lead to a media file. URLs that were assigned a higher priority were crawled before lower-priority URLs. When a media file was found, we rewarded the terms in that URL, and fractionally rewarded terms along the path that lead to that URL.

We knew that this would be an improvement over what we had, but were completely surprised at just how much of an improvement. We went from one media file in 500 URLs to one in 70, and one page in 100 was linking to a media file. This simple naive Bayes classifier that took two days to write and fit into our crawler framework gave us a 7x improvement in crawler productivity--above and beyond the 10x improvement we got with our simple measures. We found that we could eliminate most of the ad hoc measures (we continue to manually exclude things like image and PDF files, based on their extensions).

We've since refined the ML system by giving it other sources of evidence, and today we find an average of one media file in every 12 documents we look at. And, whereas it's true that media clusters on the Web, we're continually finding new sites and new clusters that contain media, meaning that the crawler is not over-focused; it continues to find new sources.

We started with a naive crawler that found one media file in every 5,000 URLs it looked at, and today our average is about one in 12. That's more than 400 times more media documents with the same amount of crawling effort.

The techniques we use are not specific to media. Any kind of crawling is amenable to the techniques we used to focus the crawl. If we got the same type of improvement from the "steam trains" example, we could go from one relevant document in 25,000 to one in 63. But it turns out that the improvements are better than linear. The machine learning system finds productive paths relatively quickly and, more importantly, eliminates unproductive paths even faster. The result is that the crawler can zero in on relevant clusters and within a few hours begin finding one relevant document out of every 20 that it looks at. Rather than taking 12 days to find 1,000 documents of questionable quality, the crawler can find 20,000 or more highly relevant documents in a single day.

And that's why focused crawling. By applying machine learning techniques and the knowledge we've learned over years of large-scale crawling, we can more efficiently crawl the Web looking for specific information and more effectively produce timely, relevant results using fewer resources.

Developers, a fire hose of web video is now available.

Today we're making our fire hose of web video available to all developers.  If you're familiar with Twitter's fire hose API you'll understand ours.  17 videos a second representing over 40 hours of video every minute come through the fire hose. These are the same videos that appear on our news service, debsnews.

More details are here: http://api.metadatalabs.com

You'll find a couple of working sample apps, VideoRoulette and VideoDeck, which will give you a feel for what's available.

The API is free to use.  We hope you will find new and interesting uses for it.

Improving our handling of robots.txt

As of this evening (3 December 2010), MLBot is using a new policy when processing robots.txt files.  This policy only affects robots.txt files that use the "Allow" directive.  For example, Google's Webmaster site about robots.txt used this as a sample of how to use Allow.

User-agent: *

Disallow: /norobots/

Allow: /norobots/index.html

In the past, we attempted to process such robots.txt files so that they would be interpreted as the author of this example intended.  We found, however, that in doing so we sometimes ended up crawling things that other authors did not intend us to crawl.  For example, we ran across this:

User-agent: *

Disallow: /norobots/

Allow: /

And ended up crawling something in the /norobots/ directory.  It's obvious to a person what the author's intent is in this particular case, but in the general case there are ambiguities that are difficult or impossible to resolve.  The problem gets even more difficult when wildcards enter the picture.  We could try to argue that this syntax is erroneous or at least ambiguous, but even if we're "right," we would risk alienating the owner and getting the reputation as a bot that does not respect robots.txt.  That is the last thing we want.

According to the Wikipedia article about the Robots exclusion standard, the standard implementation is that the first matching robots.txt pattern always wins.  Google does something different, but we don't know how they implement it and they're apparently the only bot that does Allow in that way.

So as of today we have changed our robots.txt handling to follow the "first matching pattern" rule.

This change will not in any case make MLBot assume more permissions than previous versions.  In most cases you won't see any change in behavior.  If you do see a change in behavior it will be that MLBot will refuse to crawl something that it had in the past assumed that it was allowed to crawl.  In the two cases above, the new behavior will cause MLBot to refuse to crawl any URL in the /norobots/ directory because the "first matching pattern" rule will see the Disallow for that directory and stop checking.  If you want MLBot to crawl /norobots/index.html, then you must write:

User-agent: *

Allow: /norobots/index.html

Disallow: /norobots/

Or, if you want to restrict MLBot specifically:

User-agent: MLBot

Allow: /norobots/index.html

Disallow: /norobots/

The "first matching pattern" rule is completely unambiguous and follows the standard that other bots follow.  By following it, we vastly reduce the possibility that MLBot will crawl something that you don't want crawled.

 

 

 

YouTube is the Model T of Internet Video

A hundred years ago Henry Ford introduced the Model T and took us from early-adopter horseless carriages into the modern mass-market automobile era.

Web video feels a lot like that right now and today's equivalent for the Model T would have to be YouTube.  We are constantly impressed by how much it's growing.

Here are some stats from our web crawler:

600 million media urls (and adding more than a million every day)

3.3 billion web pages link or embed media

41 million content authors have created media

600 times a second we find a link or embed to a known media url

13 times a second we find a new media url

1 time a second we find a new content author

This all looks pretty incredible but it's really just getting started.  When HTML5 video comes along and every mobile phone you can buy includes a video camera it will make today's video growth seem quaint by comparison.