Why focused crawling?
When I say that we've built a high performance focused crawler, I am often asked, "What's a focused crawler?" Most people understand the "crawler" part and even have a vague understanding of how a Web crawler works. But a "focused" crawler is a foreign concept to most people, although a short explanation makes it obvious why every crawler should be focused to some extent.
So, again, why focused crawling? Because the Web is big. Really big. WorldWideWebSize.com uses statistical sampling of Google, Bing, and other search engines to estimate the size of the indexed Web. The site says that they purposely underestimate the size, and it the estimates vary quite widely. Today's (2010-12-28) estimate is "at least 13.97 billion pages," but in the past the estimates have been as high as 65 billion. 20 billion is a good round number, and well within the historical estimation range."
20 billion is a big number. It's a lot of pages.
Not only is the Web big, but most of it is irrelevant to what you're searching for. If you're looking for information about steam trains, then information about rabbits, airplanes, rock music, computer programming, or millions of other topics is either irrelevant or at best marginally relevant to your search. Simply put, a focused crawler attempts to limit the amount of data you have to look through in order to find relevant information.
In general, we can lump the use of crawlers into three broad categories:
- Limited crawlers that examine a small handful of sites looking for very specific documents. These crawlers are typically given a list of URLs to examine in depth, and are prevented from exploring outside a very narrow range. This type of crawler is more of a scraper in that it's not really "looking for" anything, but rather gathering information in known locations. These crawlers can be invaluable business or research tools, and projects like Scrapy make building and operating them much easier.
- Breadth-first crawlers that attempt to index the entire Web. This is what most people imagine when they think of Web crawlers. They imagine a huge bank of computers connected to the Internet through a big pipe, sucking down Web pages like mad and indexing them for quick lookup. Crawlers like Googlebot, MSNBot, Yahoo Slurp, etc. fit into this category. These crawlers are incredibly difficult to build and require a vast infrastructure of cooperating computers, huge amounts of data storage, and gigabit-level Internet connections in order to download and index billions of Web pages. Building and operating a general crawler like this is hugely expensive.
- Crawlers that use various techniques to limit what kind of documents they examine, or what sites they visit in order to find potentially useful documents. A crawler of this type is given a list of starting places (seed URLs), and some rules to help direct its search. This is the broad category of focused crawlers, and in fact every crawler these days uses some focused techniques in order to do its job.
The world of focused crawlers covers the entire gamut from the lowliest page scraper to the biggest and baddest Web bot out there. You most often find them, though, in the lower end of that middle ground--just above the closely targeted scraper. They're looking for particular types of information, and typically use machine learning techniques to determine where and when to examine what parts of the Web. Whereas a scraper looks at specific things and a breadth-first crawler tries to look at everything, a focused crawler tries to look only at things that are "interesting," with "interesting" being defined as documents that contain information relevant to the search, or likely to lead to such documents.
So why is that important? Let me give you an example. A Google search today for steam trains reports 794,000 results. If we assume that there are 20 billion pages indexed on the Web, then approximately one in 25,000 pages will have information at least tangentially related to steam trains.
If we were to crawl the Web at random--just grab any old URL and download the document--then, on average, we would expect to find one document about steam trains in every 25,000 documents we examine. That's assuming, of course, that we somehow limited the crawler to looking only at URLs that are indexed, and we don't download things like videos, images, and other types of things that aren't "documents." We would have to examine 25 million documents in order to get 1,000 references to steam trains, and there's no telling how relevant those documents might be. For example, we might find this page, which mentions steam trains but isn't really relevant to them.
How long it takes to look at 25 million pages is determined by the speed of the Internet connection and the average size of a Web page. In over three years of crawling the Web, we've found that the average size of an HTML page is about 40 kilobytes. So 25 million pages would be approximately one thousand million kilobytes, or about a terabyte of data. A 10 megabit Internet connection will give you about 1 megabyte of data per second, meaning that you could crawl about 2.2 million pages per day. It would take you almost 12 days to crawl 25 million pages and get 1,000 possibly relevant pages about steam trains.
The first version of our our MLBot crawler was quite naive in this way. We gave it a starting point and it began scouring the Web looking for media files. We soon learned a few things:
- Approximately one URL out of 5,000 that we looked at contained some sort of media (a video or audio file).
- Approximately one HTML page out of 10,000 that we looked at contained a link to a media file.
- Pages that linked to media files tended to cluster--they linked to each other.
- There are many small clusters and a few really big clusters of media files scattered throughout the Web.
- A very small number of HTML pages serve as hubs; they link to multiple clusters.
- The vast majority of the Web is totally irrelevant when it comes to locating media files.
It was at this point that we asked the question that lead us to focused crawling:
What if we knew where to look, and could ignore everything else?
The ideal, of course, would be the "genie" that could just give us a list of every media file and the pages that link to those files. Then we wouldn't need to crawl at all--just harvest the files. The genie doesn't exist, of course, but we could use the information we gathered to refine the crawler's behavior. We started by doing the following:
- Suspected media files (those URLs ending with ".mp3", ".flv", etc.) were moved to the head of the queue so that they were crawled first.
- Things that were obviously not or very unlikely to be media or lead to media (".doc", ".jpg", ".pdf", etc.) were never placed into the queue to be crawled.
- Sites that were unlikely to lead to media (message forums like "forums.example.com", and other such sites) were also excluded from the queue.
- We instituted a "futility" cutoff. The crawler kept track of the depth to which it crawled, and if it didn't find media after crawling to a particular depth, that line of crawling was abandoned.
The first two items were obvious refinements and very easy to implement. The third--blocking unproductive sites--was easy enough to implement but became unwieldy in short order. It's simply impossible to manually identify every unproductive site. The last was easily implemented, too, and gave a huge improvement in performance. It turned out that if the crawler didn't find a link to a media file after crawling five levels deep, it was very unlikely ever to find one along that path.
Those simple changes gave us an order of magnitude improvement in our crawler performance. Where we had been finding one media file out of 5,000 documents, we were now getting one media file out of 500 URLs crawled. At this point we began wondering just how close we could come to the magic genie, and we began doing some research and developing a simple machine learning system that could direct (focus) the crawler more productively.
Our first ML system consisted of a naive Bayes classifier that examined the terms in URLs to assign a probability that the URL was a media file or would lead to a media file. URLs that were assigned a higher priority were crawled before lower-priority URLs. When a media file was found, we rewarded the terms in that URL, and fractionally rewarded terms along the path that lead to that URL.
We knew that this would be an improvement over what we had, but were completely surprised at just how much of an improvement. We went from one media file in 500 URLs to one in 70, and one page in 100 was linking to a media file. This simple naive Bayes classifier that took two days to write and fit into our crawler framework gave us a 7x improvement in crawler productivity--above and beyond the 10x improvement we got with our simple measures. We found that we could eliminate most of the ad hoc measures (we continue to manually exclude things like image and PDF files, based on their extensions).
We've since refined the ML system by giving it other sources of evidence, and today we find an average of one media file in every 12 documents we look at. And, whereas it's true that media clusters on the Web, we're continually finding new sites and new clusters that contain media, meaning that the crawler is not over-focused; it continues to find new sources.
We started with a naive crawler that found one media file in every 5,000 URLs it looked at, and today our average is about one in 12. That's more than 400 times more media documents with the same amount of crawling effort.
The techniques we use are not specific to media. Any kind of crawling is amenable to the techniques we used to focus the crawl. If we got the same type of improvement from the "steam trains" example, we could go from one relevant document in 25,000 to one in 63. But it turns out that the improvements are better than linear. The machine learning system finds productive paths relatively quickly and, more importantly, eliminates unproductive paths even faster. The result is that the crawler can zero in on relevant clusters and within a few hours begin finding one relevant document out of every 20 that it looks at. Rather than taking 12 days to find 1,000 documents of questionable quality, the crawler can find 20,000 or more highly relevant documents in a single day.
And that's why focused crawling. By applying machine learning techniques and the knowledge we've learned over years of large-scale crawling, we can more efficiently crawl the Web looking for specific information and more effectively produce timely, relevant results using fewer resources.