Supermind Search Consulting Blog 
Solr - Elasticsearch - Big Data

Posts about Lucene / Solr / Elasticsearch / Nutch

What's new in Solr 1.4.1

Posted by Kelvin on 20 Oct 2010 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Solr 1.4.1 is a bug-fix release. No new features.

Here's the list of bugs that were fixed.

* SOLR-1934: Upgrade to Apache Lucene 2.9.3 to obtain several bug
fixes from the previous 2.9.1. See the Lucene 2.9.3 release notes
for details. (hossman, Mark Miller)

* SOLR-1432: Make the new ValueSource.getValues(context,reader) delegate
to the original ValueSource.getValues(reader) so custom sources
will work. (yonik)

* SOLR-1572: FastLRUCache correctly implemented the LRU policy only
for the first 2B accesses. (yonik)

* SOLR-1595: StreamingUpdateSolrServer used the platform default character
set when streaming updates, rather than using UTF-8 as the HTTP headers
indicated, leading to an encoding mismatch. (hossman, yonik)

* SOLR-1660: CapitalizationFilter crashes if you use the maxWordCountOption
(Robert Muir via shalin)

* SOLR-1662: Added Javadocs in BufferedTokenStream and fixed incorrect cloning
in TestBufferedTokenStream (Robert Muir, Uwe Schindler via shalin)

* SOLR-1711: SolrJ – StreamingUpdateSolrServer had a race condition that
could halt the streaming of documents. The original patch to fix this
(never officially released) introduced another hanging bug due to
connections not being released. (Attila Babo, Erik Hetzner via yonik)

* SOLR-1748, SOLR-1747, SOLR-1746, SOLR-1745, SOLR-1744: Streams and Readers
retrieved from ContentStreams are not closed in various places, resulting
in file descriptor leaks.
(Christoff Brill, Mark Miller)

* SOLR-1580: Solr Configuration ignores 'mergeFactor' parameter, always
uses Lucene default. (Lance Norskog via Mark Miller)

* SOLR-1777: fieldTypes with sortMissingLast=true or sortMissingFirst=true can
result in incorrectly sorted results. (yonik)

* SOLR-1797: fix ConcurrentModificationException and potential memory
leaks in ResourceLoader. (yonik)

* SOLR-1798: Small memory leak (~100 bytes) in fastLRUCache for every
commit. (yonik)

* SOLR-1522: Show proper message if script tag is missing for DIH
ScriptTransformer (noble)

* SOLR-1538: Reordering of object allocations in ConcurrentLRUCache to eliminate
(an extremely small) potential for deadlock.
(gabriele renzi via hossman)

* SOLR-1558: QueryElevationComponent only works if the uniqueKey field is
implemented using StrField. In previous versions of Solr no warning or
error would be generated if you attempted to use QueryElevationComponent,
it would just fail in unexpected ways. This has been changed so that it
will fail with a clear error message on initialization. (hossman)

* SOLR-1563: Binary fields, including trie-based numeric fields, caused null
pointer exceptions in the luke request handler. (yonik)

* SOLR-1579: Fixes to XML escaping in stats.jsp
(David Bowen and hossman)

* SOLR-1582: copyField was ignored for BinaryField types (gsingers)

* SOLR-1596: A rollback operation followed by the shutdown of Solr
or the close of a core resulted in a warning:
"SEVERE: SolrIndexWriter was not closed prior to finalize()" although
there were no other consequences. (yonik)

* SOLR-1651: Fixed Incorrect dataimport handler package name in SolrResourceLoader
(Akshay Ukey via shalin)

* SOLR-1936: The JSON response format needed to escape unicode code point
U+2028 – 'LINE SEPARATOR' (Robert Hofstra, yonik)

* SOLR-1852: Fix WordDelimiterFilterFactory bug where position increments
were not being applied properly to subwords. (Peter Wolanin via Robert Muir)

* SOLR-1706: fixed WordDelimiterFilter for certain combinations of options
where it would output incorrect tokens. (Robert Muir, Chris Male)

* SOLR-1948: PatternTokenizerFactory should use parent's args (koji)

* SOLR-1870: Indexing documents using the 'javabin' format no longer
fails with a ClassCastException whenSolrInputDocuments contain field
values which are Collections or other classes that implement
Iterable. (noble, hossman)

* SOLR-1769 Solr 1.4 Replication – Repeater throwing NullPointerException (noble)

How to write a custom Solr FunctionQuery

Posted by Kelvin on 03 Sep 2010 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Solr FunctionQueries allow you to modify the ranking of a search query in Solr by applying functions to the results.

There are a list of out-of-box FunctionQueries available here: http://wiki.apache.org/solr/FunctionQuery

In order to write a custom Solr FunctionQuery, you'll need to do 2 things:

1. Subclass org.apache.solr.search.ValueSourceParser. Here's a stub ValueSourceParser.

public class MyValueSourceParser extends ValueSourceParser {
  public void init(NamedList namedList) {
  }
 
  public ValueSource parse(FunctionQParser fqp) throws ParseException {
    return new MyValueSource();
  }
}

2. In solrconfig.xml, register your new ValueSourceParser directly under the <config> tag

<valueSourceParser name="myfunc" class="com.mycompany.MyValueSourceParser" />

3. Subclass org.apache.solr.search.ValueSource and instantiate it in your ValueSourceParser.parse() method.

Lets take a look at 2 ValueSource implementations to see what they do, starting with the simplest:

org.apache.solr.search.function.ConstValueSource

Example SolrQuerySyntax: _val_:1.5

It simply returns a float value.

public class ConstValueSource extends ValueSource {
  final float constant;
 
  public ConstValueSource(float constant) {
    this.constant = constant;
  }
 
  public DocValues getValues(Map context, IndexReader reader) throws IOException {
    return new DocValues() {
      public float floatVal(int doc) {
        return constant;
      }
      public int intVal(int doc) {
        return (int)floatVal(doc);
      }
      public long longVal(int doc) {
        return (long)floatVal(doc);
      }
      public double doubleVal(int doc) {
        return (double)floatVal(doc);
      }
      public String strVal(int doc) {
        return Float.toString(floatVal(doc));
      }
      public String toString(int doc) {
        return description();
      }
    };
  }
// commented out some boilerplate stuff
}

As you can see, the important method is DocValues getValues(Map context, IndexReader reader). The gist of the method is return a DocValues object which returns a value given a document id.

org.apache.solr.search.function.OrdFieldSource

ord(myfield) returns the ordinal of the indexed field value within the indexed list of terms for that field in lucene index order (lexicographically ordered by unicode value), starting at 1. In other words, for a given field, all values are ordered lexicographically; this function then returns the offset of a particular value in that ordering.

Example SolrQuerySyntax: _val_:"ord(myIndexedField)"

public class OrdFieldSource extends ValueSource {
  protected String field;
 
  public OrdFieldSource(String field) {
    this.field = field;
  }
  public DocValues getValues(Map context, IndexReader reader) throws IOException {
    return new StringIndexDocValues(this, reader, field) {
      protected String toTerm(String readableValue) {
        return readableValue;
      }
 
      public float floatVal(int doc) {
        return (float)order[doc];
      }
 
      public int intVal(int doc) {
        return order[doc];
      }
 
      public long longVal(int doc) {
        return (long)order[doc];
      }
 
      public double doubleVal(int doc) {
        return (double)order[doc];
      }
 
      public String strVal(int doc) {
        // the string value of the ordinal, not the string itself
        return Integer.toString(order[doc]);
      }
 
      public String toString(int doc) {
        return description() + '=' + intVal(doc);
      }
    };
  }
}

OrdFieldSource is almost identical to ConstValueSource, the main differences being the returning of the order rather than a const value, and the use of StringIndexDocValues which is for obtaining the order of values.

Our own ValueSource

We now have a pretty good idea what a ValueSource subclass has to do:

return some value for a given doc id.

This can be based on the value of a field in the index (like OrdFieldSource), or nothing to do with the index at all (like ConstValueSource).

Here's one that performs the opposite of MaxFloatFunction/max() – MinFloatFunction/min():

public class MinFloatFunction extends ValueSource {
  protected final ValueSource source;
  protected final float fval;
 
  public MinFloatFunction(ValueSource source, float fval) {
    this.source = source;
    this.fval = fval;
  }
 
  public DocValues getValues(Map context, IndexReader reader) throws IOException {
    final DocValues vals =  source.getValues(context, reader);
    return new DocValues() {
      public float floatVal(int doc) {
	float v = vals.floatVal(doc);
        return v > fval ? fval : v;
      }
      public int intVal(int doc) {
        return (int)floatVal(doc);
      }
      public long longVal(int doc) {
        return (long)floatVal(doc);
      }
      public double doubleVal(int doc) {
        return (double)floatVal(doc);
      }
      public String strVal(int doc) {
        return Float.toString(floatVal(doc));
      }
      public String toString(int doc) {
	return "max(" + vals.toString(doc) + "," + fval + ")";
      }
    };
  }
 
  @Override
  public void createWeight(Map context, Searcher searcher) throws IOException {
    source.createWeight(context, searcher);
  } 
 
// boilerplate methods omitted
}

And the corresponding ValueSourceParser:

public class MinValueSourceParser extends ValueSourceParser {
  public void init(NamedList namedList) {
  }
 
  public ValueSource parse(FunctionQParser fqp) throws ParseException {
        ValueSource source = fp.parseValueSource();
        float val = fp.parseFloat();
        return new MinFloatFunction(source,val);
  }
}

Dynamic facet population with Solr DataImportHandler

Posted by Kelvin on 02 Aug 2010 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Here's what I'm trying to do:

Given this mysql table:

CREATE TABLE `tag` (
    `id` INTEGER AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `name` VARCHAR(100) NOT NULL UNIQUE,
    `category` VARCHAR(100)
);
INSERT INTO tag (name,category) VALUES ('good','foo');
INSERT INTO tag (name,category) VALUES ('awe-inspiring','foo');
INSERT INTO tag (name,category) VALUES ('mediocre','bar');
INSERT INTO tag (name,category) VALUES ('terrible','car');

and this solr schema

<field name="tag-foo" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tag-bar" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tag-car" type="string" indexed="true" stored="true" multiValued="true"/>

to populate these tag fields via DataImportHandler.

The dumb (but straightforward) way to do it is to use sub-entities, but this is terribly expensive since you use one extra SQL query per category.

Solution

My general approach was to concatenate the rows into a single row, then use RegexTransformer and a custom dataimport Transformer to split out the values.

Here's how I did it:

My dataimporthandler xml:

            <entity name="tag-facets" transformer="RegexTransformer,org.supermind.solr.TagFacetsTransformer"
                    query="select group_concat(concat(t.category,'=',t.name) separator '#') as tagfacets from tag t,booktag bt where bt.id='${book.id}' and t.category is not null">
                <field column="tagfacets" splitBy="#"/>
            </entity>

You'll see that a temporary field tagfacets is used. This will be deleted later on in TagFacetsTransformer.

package org.supermind.solr;
 
import org.apache.solr.handler.dataimport.Context;
import org.apache.solr.handler.dataimport.Transformer;
 
import java.util.List;
import java.util.Map;
 
public class TagFacetsTransformer extends Transformer {
  public Object transformRow(Map<String, Object> row, Context context) {
    Object tf = row.get("tagfacets");
    if (tf != null) {
      if (tf instanceof List) {
        List list = (List) tf;
        for (Object o : list) {
          String[] arr = ((String) o).split("=");
          if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
        }
      } else {
        String[] arr = ((String) tf).split("=");
        if (arr.length == 2) row.put("tag-" + arr[0], arr[1]);
      }
      row.remove("tagfacets");
    }
    return row;
  }
}

Here's the output via DIH's verbose output (with my own data):

<str name="tagfacets">lang=ruby#framework=ruby-on-rails</str>
<str>---------------------------------------------</str>
<lst name="transformer:RegexTransformer">
<str>---------------------------------------------</str>
<arr name="tagfacets">
<str>lang=ruby</str>
<str>framework=ruby-on-rails</str>
</arr>
<str>---------------------------------------------</str>
<lst name="transformer:org.supermind.solr.TagFacetsTransformer">
<str>---------------------------------------------</str>
<str name="tag-framework">ruby-on-rails</str>
<str name="tag-lang">ruby</str>
<str>---------------------------------------------</str>
</lst>
</lst>
</lst>

You can see the step-by-step transformation of the input value.

Pretty nifty, eh?

Upgrading to Lucene 3.0

Posted by Kelvin on 28 Apr 2010 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Recently upgraded a 3-year old app from Lucene 2.1-dev to 3.0.1.

Some random thoughts to the evolution of the Lucene API over the past 3 years:

I miss Hits

Sigh. Hits has been deprecated for awhile now, but with 3.0 its gone. And I have to say its a pain that it is.

Where I used to pass the Hits object around, now I need to pass TopDocs AND Searcher in order to get to documents.

Instead of

Document doc = hits.doc(i);

its now

Document doc = searcher.doc(topdocs.scoreDocs[i].doc);

Much more verbose with zero benefit to me as a programmer.

Nice number indexing via NumericField

Where I previously had to pad numbers for lexicographic searching, there's now a proper NumericField and NumericRangeFilter.

Lockless commits

What more can I say? Yay!!

What has not changed…

Perhaps somewhat more important than what has changed, is what has remained the same, which is 95% of the API and the query language.

3 years is a mighty long time and Lucene has experienced explosive growth during this period. The overall sanity of change is a clear sign of Lucene's committers' dedication to simplicity and a hat-tip to Doug's original architecture and vision.

Mapping neighborhoods to street addresses via geocoding

Posted by Kelvin on 19 Apr 2010 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, Ubuntu

As far as I know, none of the geocoders consistently provide neighborhood data given a street address. Useful information when consulting the gods at google proves elusive too.

Here's a step-by-step guide to obtaining neighborhood names for your street addresses (on Ubuntu).

0. Geocode your addresses if necessary using Yahoo, MapQuest or Google geocoders. (this means converting addresses into latitude and longitude).

1. Install PostGIS.

sudo apt-get install postgresql-8.3-postgis

2. Complete the postgis install

sudo -u postgres createdb mydb
sudo -u postgres createlang plpgsql mydb
cd /usr/share/postgresql-8.3-postgis/
sudo -u postgres psql -d mydb -f lwpostgis.sql
sudo -u postgres psql -d mydb -f spatial_ref_sys.sql

3. Download and import Zillow neighborhood data. For this example, we'll be using California data.

cd /tmp
wget http://www.zillow.com/static/shp/ZillowNeighborhoods-CA.zip
unzip ZillowNeighborhoods-CA.zip
shp2pgsql ZillowNeighborhoods-CA public.neighborhood > ca.sql
sudo -u postgres psql -d mydb -f ca.sql

4. Connect to psql and run a query.

sudo -u postgres psql -d mydb
select name,city from public.neighborhood where ST_Within(makepoint(-122.4773980,37.7871760), the_geom)=true ;

If you've done everything right, this should be returned from the SQL:

name | city
—————-+—————
Inner Richmond | San Francisco
(1 row)

Voila!g

Average length of a URL

Posted by Kelvin on 06 Nov 2009 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling

Aug 16 update: I ran a more comprehensive analysis with a more complete dataset. Find out the new figures for the average length of a URL

I've always been curious what the average length of a URL is, mostly when approximating memory requirements of storing URLs in RAM.

Well, I did a dump of the DMOZ URLs, sorted and uniq-ed the list of URLs.

Ended up with 4074300 unique URLs weighing in at 139406406 bytes, which approximates to 34 characters per URL.

Idea: 2-stage recovery of corrupt Solr/Lucene indexes

Posted by Kelvin on 09 Sep 2009 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

I was recently onsite with a client who happened to have a corrupt Solr/Lucene index. The CheckIndex tool (lucene 2.4+) diagnosed the problem, and gave the option of fixing it.

Except… fixing the index in this case meant losing the corrupt segment, which also happened to be the one containing over 90% of documents.

Because Solr has the concept of a doc uid (which Lucene doesn't have), what I did was write a tool for them to dump out the uids in that corrupted segment into a text file, so after recovering the index, they were able to reindex the docs that were lost in that segment.

Using Hadoop IPC/RPC for distributed applications

Posted by Kelvin on 02 Jun 2008 | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch

Hadoop is growing to be a pretty large framework – release 0.17.0 has 483 classes!

Previously, I'd written about Hadoop SequenceFile. SequenceFile is part of the org.apache.hadoop.io package, the other notable useful classes in that package being ArrayFile and MapFile which are persistent array and dictionary data structures respectively.

About Hadoop IPC

Here, I'm going to introduce the Hadoop IPC, or Inter-Process Communication subsystem. Hadoop IPC is the underlying mechanism is used whenever there is a need for out-of-process applications to communicate with one another.

Hadoop IPC
1. uses binary serialization using java.io.DataOutputStream and java.io.DataInputStream, unlike SOAP or XML-RPC.
2. is a simple low-fuss RPC mechanism.
3. is unicast does not support multi-cast.

Why use Hadoop IPC over RMI or java.io.Serialization? Here's what Doug has to say:

Why didn't I use Serialization when we first started Hadoop? Because it looked big-and-hairy and I thought we needed something lean-and-mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it.

The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we'd need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those.

Code!!!

I'm going to jump straight into some code examples to illustrate how easy Hadoop IPC can be.

Essentially, all unicast RPC calls involve a client and a server.

To create a server,

Configuration conf = new Configuration();
Server server = RPC.getServer(this, "localhost", 16000, conf);  // start a server on localhost:16000
server.start();

To create a client,

Configuration conf = new Configuration();
InetSocketAddress addr = new InetSocketAddress("localhost", 16000);  // the server's inetsocketaddress
ClientProtocol client = (ClientProtocol) RPC.waitForProxy(ClientProtocol.class, 
    ClientProtocol.versionID, addr, conf);

In this example, ClientProtocol is an interface implemented by the server class. ClientProtocol.java looks like this:

interface ClientProtocol extends org.apache.hadoop.ipc.VersionedProtocol {
  public static final long versionID = 1L;

  HeartbeatResponse heartbeat();
}

ClientProtocol defines a single method, heartbeat() which returns a HeartbeatResponse object. heartbeat() is the remote method called by the client which periodically lets the server know that it (the client) is still alive. The server then returns a HeartbeatResponse object which provides the client with some information.

Here's what HeartbeatResponse.java looks like:

public class HeartbeatResponse implements org.apache.hadoop.io.Writable {
  String status;

  public void write(DataOutput out) throws IOException {
    UTF8.writeString(out, status);
  }

  public void readFields(DataInput in) throws IOException {
    this.status = UTF8.readString(in);
  }
}

Summary

So, to summarize, you'll need
1. A server which implements an interface (which itself extends VersionedProtocol)
2. 1 or more clients which call the interface methods.
3. Any method arguments or objects returned by methods must implement org.apache.hadoop.io.Writable

Is Nutch appropriate for aggregation-type vertical search?

Posted by Kelvin on 24 Sep 2007 | Tagged as: crawling, programming, Lucene / Solr / Elasticsearch / Nutch

I get pinged all the time by people who tell me they want to build a vertical search engine with Nutch. The part I can't figure out, though, is why Nutch?

What's vertical anyway?

So let's start from basics. Vertical search engines typically fall into 2 categories:

  1. Whole-web search engines which selectively crawl the Internet for webpages related to a certain topic/industry/etc.
  2. Aggregation-type search engines which mine other websites and databases, aggregating data and repackaging it into a format which is easier to search.

Now, imagine a biotech company comes to me to develop a search engine for everything related to biotechnology and genetics. You'd have to crawl as many websites as you can, and only include the ones related to biotechnology in the search index.

How would I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to biotechnology. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites we know a priori to be relevant.

Now, second scenario. Imagine someone comes to me and want to develop a job search engine for a certain country. This would involve indexing all jobs posted in the 4 major job websites, refreshing this database on a daily basis, checking for new jobs, deleting expired jobs etc.

How would I implement this second crawler? Use Nutch? No way! Ahhhh, now we're getting to the crux of this post..

The ubiquity of Lucene … and therefore Nutch

Nutch is one of two open-source Java crawlers out there, the other being Heritrix from the good guys at the Internet Archive. Its rode on Lucene as the default choice for full-text search API. Everyone who wants to build a vertical search engine in Java these days knows they're going to use Lucene as the search API, and naturally look to Nutch for the crawling side of things. And that's when their project runs into a brick wall…

To Nutch or not to Nutch

Nutch (and Hadoop) is a very very cool project with ambitious and praiseworthy goals. They're really trying to build an open-source version of Google (not sure if that actually is the explicitly declared aims).

Before jumping into any library or framework, you want to be sure you know what needs to be accomplished. I think this is the step many people skip: they have no idea what crawling is all about, so they try to learn what crawling is by observing what a crawler does. Enter Nutch.

The trouble is, observing/using Nutch isn't necessarily the best way to learn about crawling. The best way to learn about crawling is to build a simple crawler.

In fact, if you sit down and think what a 4 job-site crawler really needs to do, its not difficult to see that its functionality is modest and humble – in fact, I can write its algorithm out here:


for each site:
  if there is a way to list all jobs in the site, then
    page through this list, extracting job detail urls to the detail url database
  else if exists browseable categories like industry or geographical location, then
    page through these categories, extracting job detail urls to the detail url database
  else 
    continue
  for each url in the detail url database:
    download the url
    extract data into a database table according to predefined regex patterns

Won't be difficult to hack up something quick to do this, especially with the help of Commons HttpClient. You'll probably also want to make this app multi-threaded.

Other things you'll want to consider, is how many simultaneous threads to hit a server with, if you want to save the HTML content of pages vs just keeping the extracted data, how to deal with errors, etc.

All in all, I think you'll find that its not altogether overwhelming, and there's actually alot to be said for the complete control you have over the crawling and post-crawl extraction processes. Compare this to Nutch, where you'll need to fiddle with various configuration files (nutch-site.xml, urlfilters, etc), where calling apps from an API perspective is difficult, you'll have to work with the various file I/O structures to reach the content (SegmentFile, MapFile etc), various issues may prevent all urls from being fetched (retry.max being a common one), if you want custom crawl logic, you'll have to patch/fork the codebase (ugh!) etc.

The other thing that Nutch offers is an out-of-box search solution, but I personally have never found a compelling reason to use it – its difficult to add custom fields, adding OR phrase capability requires patching codebase, etc. In fact, I find it much much simpler to come up with my own SearchServlet.

Even if you decide not to come up with a homegrown solution, and you want to go with Nutch. Well, here's one other thing you need to know before jumping into Nutch.

To map-reduce, or not?

From Nutch 0.7 to Nutch 0.8, there was a pretty big jump in the code complexity with the inclusion of the map-reduce infrastructure. Map-reduce subsequently got factored out, together with some of the core distributed I/O classes into Hadoop.

For a simple example to illustrate my point, just take a look at the core crawler class, org.apache.nutch.fetcher.Fetcher, from the 0.7 branch, to the current 0.9 branch.

The 0.7 Fetcher is simple and easy to understand. I can't say the same of the 0.9 Fetcher. Even after having worked abit with the 0.9 fetcher and map-reduce, I still find myself having to do mental gymnastics to figure out what's going on. BUT THAT'S OK, because writing massively distributable, scaleable yet reliable applications is very very hard, and map-reduce makes this possible and comparatively easy. The question to ask though, is, does your search engine project to crawl and search those 4 job sites fall into this category? If not, you'd want to seriously consider against using the latest 0.8x release of Nutch, and tend to 0.7 instead. Of course, the biggest problem with this, is that 0.7 is not being actively maintained (to my knowledge).

Conclusion

Perhaps someone will read this post and think I'm slighting Nutch, so let me make this really clear: _for what its designed to do_, that is, whole-web crawling, Nutch does a good job of it; if what is needed is to page through search result pages and extract data into a database, Nutch is simply overkill.

Exploring Hadoop SequenceFile

Posted by Kelvin on 03 Jan 2007 | Tagged as: Lucene / Solr / Elasticsearch / Nutch

Hadoop's SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.

So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.– not very helpful.

Let's dig through the code and find out more:

  1. supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
  2. contains 3 inner classes: Reader, Writer and Sorter
  3. Sorts are performed as external merge-sorts
  4. designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
  5. compresses values using java.util.zip.Deflater after version 3
  6. from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value

You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.

« Previous PageNext Page »