Hadoop's SequenceFile is at the heart of the Hadoop io package. Both MapFile (disk-backed Map) and ArrayFile (disk-backed Array) are built on top of SequenceFile.

So what exactly is SequenceFile? Its class javadoc tells us: Support for flat files of binary key/value pairs.- not very helpful.

Let's dig through the code and find out more:

  1. supports key/value pair, where key and value are any arbitrary classes which implement org.apache.hadoop.io.Writable
  2. contains 3 inner classes: Reader, Writer and Sorter
  3. Sorts are performed as external merge-sorts
  4. designed to be modified in batch, i.e. does NOT support appends or incremental updates. Modifications/appends involving creating a new SequenceFile, copying from the old->new, adding/changing values along the way
  5. compresses values using java.util.zip.Deflater after version 3
  6. from code comments: Inserts a globally unique 16-byte value every few entries, so that one can seek into the middle of a file and then synchronize with record starts and ends by scanning for this value

You might also be interested in a recent post on using Hadoop IPC/RPC for distributed applications.