Thoughts on Lucene, Solr, Nutch and vertical search 

Lucene / Solr / Nutch

Archived Posts from this Category

OC and focused crawling

Posted by Kelvin on 26 Feb 2006 | Tagged as: Lucene / Solr / Nutch

I’ve had the good fortune to get paid to work on OC (Our Crawler). Features I’ve been developing have been for focused crawling purposes.

Specifically:

  1. Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
  2. Checkpointing the crawl output (which is a Nutch segment) at time intervals, e.g. every 60 minutes. This is insurance against hung crawls, or if the crawler hit a bot-trap and couldn’t exit.
  3. Time-limited “perpetual crawling” where the crawler would keep going until a time limit was reached, in which case it will stop all threads and exit gracefully.
  4. Introducing various fetchlist filters which reduce the chances of getting lost in bot-traps, such as don’t go further than x levels deep within a host, and reject URLs which repeatedly increase the number of query parameters.
  5. MySQL and BDB-backed persistence.

In addition, some refactoring has also taken place that makes it easier to run crawls via API (as opposed to command-line or Spring). The role of Spring has also been relegated from obligatory to optional (but sweet to have, all the same).

We’re still discussing the details of whether all of the code can be open-sourced, though. I’m keeping my fingers crossed.

Next on the plate is support for distributed crawling. Will OC use Nutch’s Map-Reduce? That remains to be seen…

The next few months for OC

Posted by Kelvin on 28 Jan 2006 | Tagged as: Lucene / Solr / Nutch

Crawling Basics

Posted by Kelvin on 27 Oct 2005 | Tagged as: Lucene / Solr / Nutch, work

Practical introduction to Nutch MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: Lucene / Solr / Nutch, work

Hello World for MapReduce

Posted by Kelvin on 28 Sep 2005 | Tagged as: Lucene / Solr / Nutch, work

OC and Nutch MapReduce

Posted by Kelvin on 15 Sep 2005 | Tagged as: Lucene / Solr / Nutch, programming, work

Inside Our Crawler

Posted by Kelvin on 25 Aug 2005 | Tagged as: Lucene / Solr / Nutch, programming, work

Our Crawler Todo List

Posted by Kelvin on 25 Aug 2005 | Tagged as: Lucene / Solr / Nutch, programming

Limitations of OC

Posted by Kelvin on 19 Aug 2005 | Tagged as: Lucene / Solr / Nutch, programming, work

Reflections on modifying the Nutch crawler

Posted by Kelvin on 16 Aug 2005 | Tagged as: Lucene / Solr / Nutch, programming, work

« Previous PageNext Page »