Writing a new Lucene Codec

The just-released Lucene 4.0.0-alpha allows you to customize Lucene’s index format in any way you want, by creating a new Codec. I recently implemented one of these that stores postings data in Redis, as part of a proof-of-concept project investigating updateable fields (see this blog post for more details).

While the implementation details of a new Codec will vary wildly with what you’re trying to do (in the case above, the codec is very naive, storing postings lists as simple integer arrays in redis, keyed by term and segment name, and is almost certainly not suitable for production use!), the process of registering and using the codec will be the same in most cases. So here’s how to do it:

Registering your Codec

Codecs are loaded through the SPI mechanism, so in order to make your codec available, you have to register it. This is done by adding a file

META-INF/services/org.apache.lucene.codecs.Codec

to your classpath, that contains the package-qualified classname of your codec:

$ cat META-INF/services/org.apache.lucene.codecs.Codec
# List of codecs
org.foo.lucene.FrabjousCodec

Codecs or PostingsFormats?

A codec implementation tells Lucene how to store postings format, term vectors, docvalues, and all sorts of other beasts. In many cases, however, including the redis-backed codec described above, you’ll only want to change the postings format for a particular field, leaving all other information to be stored as normal. Lucene allows you to register specific postings formats as well as codecs, this time in

META-INF/services/org.apache.lucene.codecs.PostingsFormat

Using your codec

You tell an IndexWriter which codec to use as part of its IndexWriterConfig:

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, new KeywordAnalyzer());
iwc.setCodec(new FrabjousCodec());
IndexWriter writer = new IndexWriter(dir, iwc);

If you want to use different postings formats on different fields, you can create a new Codec by extending Lucene40Codec and subclassing getPostingsFormatForField():

public class BandersnatchCodec {
    final PostingsFormat lucene40 = new Lucene40PostingsFormat();
    final PostingsFormat frumious = new FrumiousPostingsFormat();

    @Override
    public PostingsFormat getPostingsFormatForField(String field) {
        return "frumiousfield".equals(field) ? frumious : lucene40;
    }
}

Add documents and commit as normal, and any IndexReaders (that have access to your codec implementation, of course) will be able to read the index using your Codec.

Comments are closed.