Nutch Consultant

What is Nutch

From the Nutch website, Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

How I can help

My involvement with clients as a Nutch consultant generally revolves around:

  • Building a vertical search app
  • Implementing a custom crawl solution
  • Mentoring and training an inhouse team of developers tasked with using Nutch to run a crawler

Custom crawl solutions

Some clients come to me with a specific crawling requirement which they're not sure how to implement.

The app to be built is typically neither vertical search, nor whole-web search, but some interesting amalgam of the both.

I really enjoy working on projects like these, and am usually well-tailored to such an "odd-job" by virtue of my crawling/search experience.

The final solution usually ends up being part-Nutch, part-custom app.

Training

I conduct both onsite and offsite training for Nutch.

The format of these training sessions is typically either 1 4-hour session or 2 2-hour sessions.

Topics covered include:

  1. Crawling fundamentals
  2. Map and Reduce in Nutch
  3. How Nutch uses Lucene
  4. Deciphering Nutch segments
  5. Digging into Nutch internals

At the end of a training session, attendees should be able to understand:

  1. Basic crawling concepts and challenges
  2. How to work with Nutch integration points (i.e. plugins and filters)
  3. Where to look in the Nutch codebase if they want to modify behavior