Blog 

Lucene / Nutch

Archived Posts from this Category

Using Hadoop IPC/RPC for distributed applications

Posted by Kelvin on 02 Jun 2008 | Tagged as: Lucene / Nutch, programming

Hadoop is growing to be a pretty large framework - release 0.17.0 has 483 classes!

Previously, I’d written about Hadoop SequenceFile. SequenceFile is part of the org.apache.hadoop.io package, the other notable useful classes in that package being ArrayFile and MapFile which are persistent array and dictionary data structures respectively.

About Hadoop IPC

Here, I’m going to introduce the Hadoop IPC, or Inter-Process Communication subsystem. Hadoop IPC is the underlying mechanism is used whenever there is a need for out-of-process applications to communicate with one another.

Hadoop IPC
1. uses binary serialization using java.io.DataOutputStream and java.io.DataInputStream, unlike SOAP or XML-RPC.
2. is a simple low-fuss RPC mechanism.
3. is unicast does not support multi-cast.

Why use Hadoop IPC over RMI or java.io.Serialization? Here’s what Doug has to say:

Why didn’t I use Serialization when we first started Hadoop? Because it looked big-and-hairy and I thought we needed something lean-and-mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it.

The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we’d need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those.

Code!!!

I’m going to jump straight into some code examples to illustrate how easy Hadoop IPC can be.

Essentially, all unicast RPC calls involve a client and a server.

To create a server,

Configuration conf = new Configuration();
Server server = RPC.getServer(this, "localhost", 16000, conf);  // start a server on localhost:16000
server.start();

To create a client,

Configuration conf = new Configuration();
InetSocketAddress addr = new InetSocketAddress("localhost", 16000);  // the server's inetsocketaddress
ClientProtocol client = (ClientProtocol) RPC.waitForProxy(ClientProtocol.class,
    ClientProtocol.versionID, addr, conf);

In this example, ClientProtocol is an interface implemented by the server class. ClientProtocol.java looks like this:

interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol {
  public static final long versionID = 1L;

  HeartbeatResponse heartbeat();
}

ClientProtocol defines a single method, heartbeat() which returns a HeartbeatResponse object. heartbeat() is the remote method called by the client which periodically lets the server know that it (the client) is still alive. The server then returns a HeartbeatResponse object which provides the client with some information.

Here’s what HeartbeatResponse.java looks like:

public class HeartbeatResponse implements org.apache.hadoop.io.Writable {
  String status;

  public void write(DataOutput out) throws IOException {
    UTF8.writeString(out, status);
  }

  public void readFields(DataInput in) throws IOException {
    this.status = UTF8.readString(in);
  }
}

Summary

So, to summarize, you’ll need
1. A server which implements an interface (which itself extends VersionedProtocol)
2. 1 or more clients which call the interface methods.
3. Any method arguments or objects returned by methods must implement org.apache.hadoop.io.Writable

Is Nutch appropriate for aggregation-type vertical search?

Posted by Kelvin on 24 Sep 2007 | Tagged as: Lucene / Nutch, programming

I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can’t figure out, though, is why Nutch?

What’s vertical anyway?

So let’s start from basics. Vertical search engines typically fall into 2 categories:

  1. Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
  2. Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.

Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You’d have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.

How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I’d probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.

Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.

How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we’re getting to the crux of this post..

The ubiquity of Lucene … and therefore Nutch

Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they’re going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that’s when their project runs into a brick wall…

To Nutch or not to Nutch

Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They’re really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).

Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.

The trouble is, observing/using Nutch isn’t necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.

In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble - in fact, I can write its algorithm out here:


for each site:
  if there is a way to list all jobs in the site, then
    page through this list, extracting job detail urls to the detail url database
  else if exists browseable categories like industry or geographical location, then
    page through these categories, extracting job detail urls to the detail url database
  else
    continue
  for each url in the detail url database:
    download the url
    extract data into a database table according to predefined regex patterns

Won’t be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You’ll probably also want to make this app multi-threaded.

Other things you’ll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.

All in all, I think you’ll find that its not altogether overwhelming, and there’s actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you’ll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you’ll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you’ll have to patch/fork the codebase (ugh!) etc.

The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it - its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.

Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here’s one other thing you need to know before jumping into Nutch.

To map-reduce, or not?

From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.

For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.

The 0.7 Fetcher is simple and easy to understand. I can’t say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what’s going on. BUT THAT’S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you’d want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).

Conclusion

Perhaps someone will read this post and think I’m slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.

Exploring Hadoop SequenceFile

Posted by Kelvin on 03 Jan 2007 | Tagged as: Lucene / Nutch

Hadoop’s SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.

So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.- not very helpful.

Let’s dig through the code and find out more:

  1. supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
  2. contains 3 inner classes: Reader, Writer and Sorter
  3. Sorts are performed as external merge-sorts
  4. designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
  5. compresses values using java.util.zip.Deflater after version 3
  6. from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value

You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.

PHP + Lucene integration

Posted by Kelvin on 01 Jan 2007 | Tagged as: Lucene / Nutch, programming

I’ve had very positive experiences integrating PHP front-end with a Lucene back-end, not with any fancy Java-in-PHP wizardry or Zend Lucene, but plain-old JSON-over-REST.

I’ve done some simple load tests, and it clearly outperforms Zend Lucene (though I don’t can’t offer you any numbers to back this claim up). Zend Lucene seems to suck bad at large (1GB+) indexes.

In terms of implementation, a servlet takes a query parameter, performs the search, then returns the results as JSON objects. There is a well defined API, and the solution offers the full expressiveness of Lucene query mechanism.

With a little creativity, you can even perform stuff like facet counting (see the left sidebar).

By abstracting the search presentation layer (JSON in this case), it would be a matter of minutes to add a XML or HTML display on top of it.

A simple API-friendly crawler

Posted by Kelvin on 01 Dec 2006 | Tagged as: Lucene / Nutch, programming

Alright. I know I’ve blogged about this before. Well, I’m revisiting it again.

My sense is that there’s a real need for a simple crawler which is easy to use as an API and doesn’t attempt to be everything to everyone.

Yes, Nutch is cool, but I’m so tired of fiddling around with configuration files, the proprietary fileformats, and the filesystem-dependence of plugins. Also, crawl progress reporting is poor unless you’re intending to be parsing log files.

Here are some thoughts on what a simple crawler might look like:

Download all pages in a site


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.setScope(new HostScope(url));
    c.start();

Download all urls from a file (depth 1 crawl)


    SimpleCrawler c = new SimpleCrawler();
    c.setMaxConnectionsPerHost(5);
    c.setIntervalBetweenConsecutiveRequests(1000);
    c.addURLs(new File(file));
    c.setLinkExtractor(null);
    c.setOutput(new DirectoryPerDomain(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Page through a search results page via regex


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setLinkExtractor(new RegexLinkExtractor(regex));
    c.setOutput(new SaveToDisk(downloaddir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

Save to nutch segment for compatibility


    SimpleCrawler c = new SimpleCrawler();
    c.addURL(url);
    c.setOutput(new NutchSegmentOutput(segmentdir));
    c.setProgressListener(new StdOutProgressListener());
    c.start();

I’m basically trying to find the sweet-spot between Commons HttpClient, and a full-blown crawler app like Nutch.

Thoughts?

Search and crawling internship

Posted by Kelvin on 31 Oct 2006 | Tagged as: Lucene / Nutch, programming

I’m looking for a competent Java programmer who wants to get into the world of search engines and crawlers. The internship will involve a mixture of (some) training and (mostly) hands-on projects with real-world clients.

In particular, my area of expertise is in vertical search (real estate, jobs, classifieds), so more than likely, that will be the bulk of the exposure.

You’ll be working mostly with Lucene and Nutch, but more than just using them, you’ll be delving into their innards and pushing the limits of these libraries.

Upon graduation, if you’re interested, you’ll be invited to stay on and work alongside me on contracting projects, some of which might involve on-site work, perhaps overseas.

The background to this is that I’m basically tired of turning down consulting work because of my limited time, and want to start building a search/crawling consulting practice.

So if you’re interested, send your resume along to internship AT supermind.org, with a brief note on your background and interest in this field.

Nutch 0.8, Map & Reduce, here I come!

Posted by Kelvin on 09 Aug 2006 | Tagged as: Lucene / Nutch, programming

Finally taking the plunge to Nutch 0.8 after exclusively working with 0.7 for over a year (and something like 5 projects).

From initial experiences, it appears that using M&R does obfuscate the code somewhat for a developer who wants to build an app off the Nutch infrastructure instead of using it out-of-box. For example, trying to decipher what’s going on in org.apache.nutch.indexer.Indexer is pretty difficult, compared to its 0.7 counterpart (IndexSegment).

Some serious documentation needs to be done around the implementation details of M&R. I’ll keep posting about my learnings..

Lucene scoring for dummies

Posted by Kelvin on 08 Mar 2006 | Tagged as: Lucene / Nutch

The factors involved in Lucene’s scoring algorithm are as follows:

1. tf = term frequency in document = measure of how often a term appears in the document
2. idf = inverse document frequency = measure of how often the term appears across the index
3. coord = number of terms in the query that were found in the document
4. lengthNorm = measure of the importance of a term according to the total number of terms in the field
5. queryNorm = normalization factor so that queries can be compared
6. boost (index) = boost of the field at index-time
7. boost (query) = boost of the field at query-time

The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don’t explicitly specify a similarity, are:

note: the implication of these factors should be read as, “Everything else being equal, …

1. tf
Implementation: sqrt(freq)
Implication: the more frequent a term occurs in a document, the greater its score
Rationale: documents which contains more of a term are generally more relevant

2. idf
Implementation: log(numDocs/(docFreq+1)) + 1
Implication: the greater the occurrence of a term in different documents, the lower its score
Rationale: common terms are less important than uncommon ones

3. coord
Implementation: overlap / maxOverlap
Implication: of the terms in the query, a document that contains more terms will have a higher score
Rationale: self-explanatory

4. lengthNorm
Implementation: 1/sqrt(numTerms)
Implication: a term matched in fields with less terms have a higher score
Rationale: a term in a field with less terms is more important than one with more

queryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. It is implemented as 1/sqrt(sumOfSquaredWeights)

So, roughly speaking (quoting Mark Harwood from the mailing list),
* Documents containing *all* the search terms are good
* Matches on rare words are better than for common words
* Long documents are not as good as short ones
* Documents which mention the search terms many times are good

The mathematical definition of the scoring can be found at http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html

Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance

Customizing scoring

Its easy to customize the scoring algorithm. Just subclass DefaultSimilarity and override the method you want to customize.

For example, if you want to ignore how common a term appears across the index,


Similarity sim = new DefaultSimilarity() {
  public float idf(int i, int i1) {
    return 1;
  }
}

and if you think for the title field, more terms is better


Similarity sim = new DefaultSimilarity() {
  public float lengthNorm(String field, int numTerms) {
    if(field.equals("title")) return (float) (0.1 * Math.log(numTerms));
    else return super.lengthNorm(field, numTerms);
  }
}

OC and focused crawling

Posted by Kelvin on 26 Feb 2006 | Tagged as: Lucene / Nutch

I’ve had the good fortune to get paid to work on OC (Our Crawler). Features I’ve been developing have been for focused crawling purposes.

Specifically:

  1. Ranking content by relevance to a supplied query and crawling the most relevant links first, with the possibility of specifying a score threshold
  2. Checkpointing the crawl output (which is a Nutch segment) at time intervals, e.g. every 60 minutes. This is insurance against hung crawls, or if the crawler hit a bot-trap and couldn’t exit.
  3. Time-limited “perpetual crawling” where the crawler would keep going until a time limit was reached, in which case it will stop all threads and exit gracefully.
  4. Introducing various fetchlist filters which reduce the chances of getting lost in bot-traps, such as don’t go further than x levels deep within a host, and reject URLs which repeatedly increase the number of query parameters.
  5. MySQL and BDB-backed persistence.

In addition, some refactoring has also taken place that makes it easier to run crawls via API (as opposed to command-line or Spring). The role of Spring has also been relegated from obligatory to optional (but sweet to have, all the same).

We’re still discussing the details of whether all of the code can be open-sourced, though. I’m keeping my fingers crossed.

Next on the plate is support for distributed crawling. Will OC use Nutch’s Map-Reduce? That remains to be seen…

The next few months for OC

Posted by Kelvin on 28 Jan 2006 | Tagged as: Lucene / Nutch

I had a chat with Mike from Atlassian recently, and have arrived at the conclusion that the future of OC lies in being a crawler API, much like what Lucene does for searching. I suppose it will lie somewhere between Nutch (full-blown whole-web crawler) and Commons HTTPClient.

Some directions I will explore include:

  • Introducing checkpointing to recover from hung/crashed crawls
  • A GUI app (probably thinlet-based) for monitoring crawl status
  • Authentication databases for sites (username/password or cookies)
  • Alternatives to Nutch’s SegmentFile

I expect to have some free time on my hands to resume work on OC in the coming months.

Update 270106:
Well, checkpointing at the segment level is done at least. So I can’t yet recover from a failed crawl, but at least I don’t lose everything. :-) Its a timer-based checkpointer, so it closes the segment writers every 60 minutes and opens a new one for writing.

Database mining would be very cool. With support for paging and stuff, though it feels like its somewhat peripheral to OC. If we include some kind of regex/parsing framework for screenscraping, we would already have like 85% of what we need to build a vertical search portal.

There is already an alternative to SegmentFile for online DB updates, and its BerkeleyDB. A simple BerkeleyDB persister has already been written.. but but but.. I don’t like that its GPL (albeit a looser version of GPL). So, one day when I’m feeling particularly inspired, I’ll hack up my own BTree and external HashMap implementation.

Now, a GUI for crawling would be totally sweet. In fact, that’s probably what I’m going to work on next.

Next Page »

07/04/08 | Kelvin Tan | Lucene Vertical Search Consultant