Here's my attempt at providing an introduction to crawling, in the hope that it'll clarify some of my own thoughts on the topic.

The most basic tasks of a HTTP crawler are:

  1. Download urls via the HTTP protocol.
  2. Extract urls from downloaded pages to be added to the download queue.

That's actually very simple, and crawlers are very simple applications at heart. 2 factors complicate crawler development:

  • they deal with large datasets, and require care when designing the appopriate data structures and algorithms for acceptable performance.
  • they require either multi-threaded or non-blocking IO implementations, and both are non-trivial to do well.

To be abit more intelligent/useful, crawlers usually also provide some way of

  1. restricting the scope of the crawl
  2. prioritizing the download queue
  3. storing the downloaded pages for easy retrieval, preferably in compressed form
  4. logging and status reporting
  5. normalizing the urls
  6. being compliant with HTTP specifications, e.g. using robots.txt
  7. being polite to servers
  8. recrawling urls

Let's go through the interesting ones..

  • restricting the scope of the crawl: generally accomplished via 2 methods – duplicate url detection (removing urls that have already been downloaded or are already in the queue) and user-defined crawl scopes. The 1st is performed automatically by the crawler, and the 2nd by the user. Ways in which the user can limit the crawl scope include: limit urls to same hosts as seed urls, limit by depth from seed, regex, etc.

    A note about restricting scope of crawl by content for whole-web focused crawlers: Content-based sites tend to cluster (i.e. link to each other) and crawling an entire cluster is easy. However, clusters not linked from seed urls are difficult to get to, and moving up a cluster might be difficult, i.e. given 2 clusters A and B and 2 link hubs A1 and B1, A1 and B1 might share a link, but the common site may not link to both or either of them.

  • prioritizing the download queue: important for both whole-web and focused crawlers, but for different reasons.

    Whole-web crawlers
    The last thing a whole-web crawler wants to do is crawl pages indiscriminately. It generally wants to crawl pages that are both authoritative and have a high incoming and outgoing link count (hubs) as early as possible. It also needs to distribute requests to be polite to its servers.

    Focused crawlers
    Focused crawlers can be further broken down into

    • whole-web focused crawlers which are interested in a subset of the whole-web, usually partitioned by language or content. Queue prioritization would be similar to whole-web.
    • Site mirroring focused crawlers (SMFC) which are only interested in their seed urls (and possibly all pages within that domain/host). SMFCs can radically decrease overall crawl time by being smart about prioritzing their urls, especially when the hosts to crawl are small (<100). Btw, I suspect that the majority of developers who use Nutch actually belong to this category.

  • url normalization: ensures that urls that would be filtered out don't sneak through. Includes remembering redirects encountered, for example.

More advanced crawlers also

  • checkpoint at regular intervals so interrupted crawls can be recovered
  • detect and avoid bot traps
  • throttle bandwith usage according to bandwidth availability and server response time
  • allow crawling speed to be scaled (linearly) by adding more crawl boxes

Ignoring the advanced features, the high-level modules are:

  • HTTP request and response module
  • Link extractor (can be regex or html parser)
  • download queue (we'll call this the fetchlist from now on), and which urls are currently being downloaded
  • database of downloaded urls
  • database of downloaded pages
  • link graph/database
  • url normalizer

That's all for now. In a follow-up post, I'll walkthrough the typical application flow of a crawler, and the interactions between the modules/data structures.