Proximity boosting in elasticsearch

When you search for the term ‘annual leave’, you’re more interested, generally speaking, in documents that have those two terms next to each other than in documents that contain the two terms seperated by a few pages. You want the proximity of the several terms to be taken into account when scoring documents for relevancy.

Elasticsearch (as of version 0.19.8) doesn’t have proximity boosting of this kind built-in, but it is possible to implement it by modifying your queries. Here’s how:

  1. Take your multi-term query, and split it into shingles.  So the query “correct horse battery staple”, for example, is shingled into three separate queries, “correct horse”, “horse battery” and “battery staple”
  2. Wrap your original query in a boolean query, with the original query as the sole ‘must’ clause.
  3. Add the shingles as phrase queries in ‘should’ clauses.

Under the hood, Lucene scores boolean queries by summing the subscores of ‘should’ clauses.  Each instance of a shingle in a document will increase the document’s score, acting as a proximity boost.  For even more boosting, you can add the original query reformulated as a phrase query to the boolean as well.

Our final query looks something like this:

    'bool': {
      'must' : {
        'query_string' : {
           'field' : 'text',
           'query' : 'correct horse battery staple'
      'should' : [
        'text_phrase' : { 'text' : 'correct horse' },
        'text_phrase' : { 'text' : 'horse battery' },
        'text_phrase' : { 'text' : 'battery staple' },
        'text_phrase' : { 'text' : 'correct horse battery staple' },

Depending on your corpus and the type of queries you use, you can also add shingles of different lengths, and/or boost the shingles. As with all things in relevancy, the actual boost numbers should be determined by experiment.

Writing a new Lucene Codec

The just-released Lucene 4.0.0-alpha allows you to customize Lucene’s index format in any way you want, by creating a new Codec. I recently implemented one of these that stores postings data in Redis, as part of a proof-of-concept project investigating updateable fields (see this blog post for more details).

While the implementation details of a new Codec will vary wildly with what you’re trying to do (in the case above, the codec is very naive, storing postings lists as simple integer arrays in redis, keyed by term and segment name, and is almost certainly not suitable for production use!), the process of registering and using the codec will be the same in most cases. So here’s how to do it:

Registering your Codec

Codecs are loaded through the SPI mechanism, so in order to make your codec available, you have to register it. This is done by adding a file


to your classpath, that contains the package-qualified classname of your codec:

$ cat META-INF/services/org.apache.lucene.codecs.Codec
# List of codecs

Codecs or PostingsFormats?

A codec implementation tells Lucene how to store postings format, term vectors, docvalues, and all sorts of other beasts. In many cases, however, including the redis-backed codec described above, you’ll only want to change the postings format for a particular field, leaving all other information to be stored as normal. Lucene allows you to register specific postings formats as well as codecs, this time in


Using your codec

You tell an IndexWriter which codec to use as part of its IndexWriterConfig:

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, new KeywordAnalyzer());
iwc.setCodec(new FrabjousCodec());
IndexWriter writer = new IndexWriter(dir, iwc);

If you want to use different postings formats on different fields, you can create a new Codec by extending Lucene40Codec and subclassing getPostingsFormatForField():

public class BandersnatchCodec {
    final PostingsFormat lucene40 = new Lucene40PostingsFormat();
    final PostingsFormat frumious = new FrumiousPostingsFormat();

    public PostingsFormat getPostingsFormatForField(String field) {
        return "frumiousfield".equals(field) ? frumious : lucene40;

Add documents and commit as normal, and any IndexReaders (that have access to your codec implementation, of course) will be able to read the index using your Codec.

Nicer multiselect boxes with Chosen

One of the classic nightmare user requests is for a dropdown list containing several thousand entries – it’s almost impossible to make this useable without some serious hacking.

Fortunately, this is the web, so somebody else has already done the serious hacking for me. I spent a frustrating few hours writing a javascript select box extension before running across chosen, which basically does it all for me. It works as either a jQuery or Prototype plugin, and is incredibly simple to use:


It also fits in seamlessly with Bootstrap css. Another win for open source and the web!

One minor niggle is that it isn’t trivial to enable or disable the dropdown. Once you’ve changed the value of the ‘disabled’ attribute on the parent select box, you need to fire a trigger event to let chosen know about it:

    $('#my-select-box').attr('disabled', true).trigger("liszt:updated");

Debugging a running Solr server

I’ve found this useful when I’m developing plugins, or just trying to fathom how some weird corner of Solr works. You can remotely debug a Solr instance running under Jetty by adding the following to the Jetty parameters:

-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=5005

This tells the Jetty JVM to listen for debugger connections on port 5005 using JPDA. You can then connect to it using your IDE debugger, and add breakpoints/step through code, just as you would with any other project.

Gradle and UTF-8

Gradle is lovely, it really is.  I love not having to write reams of XML to build stuff.  I love being able to define functions in my build system, and not copy and paste chunks of logic everywhere.

It has its irritations, though.  One of which is that it doesn’t seem to detect character encoding properly.

So if, like a normal, sensible person, you encode your source files in UTF-8, gradle will not tell the java compiler that.  And your classes will end up being mangled.

To fix, you need to drop the following incantation into your build.gradle file:

[ compileJava, compileTestJava ]*.options*.encoding = 'UTF-8'

So intuitive, I know…

Lucene MemoryIndex and term offsets

The project I’m currently working on uses a sort-of backwards search engine. Rather than indexing a whole bunch of documents and then running individual queries over that index, we have a large set of registered queries that we run over individual documents as they enter the system – a bit like the elasticsearch percolator. In our case we’re using Lucene, without any search servers on top.

Lucene contains a utility class called a MemoryIndex which is a perfect fit for this case – it holds a single document in memory, allowing you to run lots of queries over that document very efficiently (the docs talk of ~500,000 simple term queries per second on a Macbook; I’m getting ~75,000 complex SpanQueries per second on my MacBook Pro).

I need to be able to extract term offsets from my queries, for highlighting purposes, but unfortunately the MemoryIndex in the current version of Lucene doesn’t support that. Happily, though, Lucene is an open source project. So I opened a ticket in JIRA, and submitted a patch, which was accepted almost immediately.

My first proper open source contribution. It’s amazingly satisfying, and probably addictive.

NOT WITHIN queries in Lucene

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects.  dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for "fish" within two positions of "chips", you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2. 

You can also search for terms that are not within X positions of another term.  This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

    fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is "fish" NOT WITHIN/2 "chips". A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal "fish" WITHIN/2 "chips" query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for "George" within 2 positions of "Bush", not overlapping with "H".

In our specific case, we want to find instances of “fish” that do not overlap with Spans of "fish" within/2 "chips". So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
                                                distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.