I've had the good fortune to get paid to work on OC (Our Crawler). Features I've been developing have been for focused crawling purposes.

Specifically:

  1. Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
  2. Checkpointing the crawl output (which is a Nutch segment) at time intervals, e.g. every 60 minutes. This is insurance against hung crawls, or if the crawler hit a bot-trap and couldn't exit.
  3. Time-limited "perpetual crawling" where the crawler would keep going until a time limit was reached, in which case it will stop all threads and exit gracefully.
  4. Introducing various fetchlist filters which reduce the chances of getting lost in bot-traps, such as don't go further than x levels deep within a host, and reject URLs which repeatedly increase the number of query parameters.
  5. MySQL and BDB-backed persistence.

In addition, some refactoring has also taken place that makes it easier to run crawls via API (as opposed to command-line or Spring). The role of Spring has also been relegated from obligatory to optional (but sweet to have, all the same).

We're still discussing the details of whether all of the code can be open-sourced, though. I'm keeping my fingers crossed.

Next on the plate is support for distributed crawling. Will OC use Nutch's Map-Reduce? That remains to be seen…