I've been working on Nutch lately for a client, and its good fun feeling my way around such an ambitious project. Its still rather immature – the code is stable, and there are no major bugs, but the API isn't yet developer-friendly, in that its difficult to extend many classes without patching Nutch directly.

Its interesting to see Doug Cutting put Lucene through its paces in Nutch. It gives an indication of how Lucene can be made to do some interesting stuff. I think Nutch is the best available case study for how to power-use Lucene, and do stuff like distributed indexing and searching.

I would love to see

  1. the crawling part of Nutch extracted into a separate lib, and I made a request on the mailing list for it, but no response..
  2. easier-to-use console apps for manipulating the webdb
  3. …TBD