I had a chat with Mike from Atlassian recently, and have arrived at the conclusion that the future of OC lies in being a crawler API, much like what Lucene does for searching. I suppose it will lie somewhere between Nutch (full-blown whole-web crawler) and Commons HTTPClient.

Some directions I will explore include:

  • Introducing checkpointing to recover from hung/crashed crawls
  • A GUI app (probably thinlet-based) for monitoring crawl status
  • Authentication databases for sites (username/password or cookies)
  • Alternatives to Nutch's SegmentFile

I expect to have some free time on my hands to resume work on OC in the coming months.

Update 270106:
Well, checkpointing at the segment level is done at least. So I can't yet recover from a failed crawl, but at least I don't lose everything. 🙂 Its a timer-based checkpointer, so it closes the segment writers every 60 minutes and opens a new one for writing.

Database mining would be very cool. With support for paging and stuff, though it feels like its somewhat peripheral to OC. If we include some kind of regex/parsing framework for screenscraping, we would already have like 85% of what we need to build a vertical search portal.

There is already an alternative to SegmentFile for online DB updates, and its BerkeleyDB. A simple BerkeleyDB persister has already been written.. but but but.. I don't like that its GPL (albeit a looser version of GPL). So, one day when I'm feeling particularly inspired, I'll hack up my own BTree and external HashMap implementation.

Now, a GUI for crawling would be totally sweet. In fact, that's probably what I'm going to work on next.