I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can't figure out, though, is why Nutch?

What's vertical anyway?

So let's start from basics. Vertical search engines typically fall into 2 categories:

  1. Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
  2. Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.

Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.

How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.

Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.

How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..

The ubiquity of Lucene … and therefore Nutch

Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall…

To Nutch or not to Nutch

Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).

Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.

The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.

In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble – in fact, I can write its algorithm out here:

for each site:
  if there is a way to list all jobs in the site, then
    page through this list, extracting job detail urls to the detail url database
  else if exists browseable categories like industry or geographical location, then
    page through these categories, extracting job detail urls to the detail url database
  for each url in the detail url database:
    download the url
    extract data into a database table according to predefined regex patterns

Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.

Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.

All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.

The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it – its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.

Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.

To map-reduce, or not?

From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.

For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.

The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).


Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.