Gradle and UTF-8

Gradle is lovely, it really is.  I love not having to write reams of XML to build stuff.  I love being able to define functions in my build system, and not copy and paste chunks of logic everywhere.

It has its irritations, though.  One of which is that it doesn’t seem to detect character encoding properly.

So if, like a normal, sensible person, you encode your source files in UTF-8, gradle will not tell the java compiler that.  And your classes will end up being mangled.

To fix, you need to drop the following incantation into your build.gradle file:

[ compileJava, compileTestJava ]*.options*.encoding = 'UTF-8'

So intuitive, I know…

Lucene MemoryIndex and term offsets

The project I’m currently working on uses a sort-of backwards search engine. Rather than indexing a whole bunch of documents and then running individual queries over that index, we have a large set of registered queries that we run over individual documents as they enter the system – a bit like the elasticsearch percolator. In our case we’re using Lucene, without any search servers on top.

Lucene contains a utility class called a MemoryIndex which is a perfect fit for this case – it holds a single document in memory, allowing you to run lots of queries over that document very efficiently (the docs talk of ~500,000 simple term queries per second on a Macbook; I’m getting ~75,000 complex SpanQueries per second on my MacBook Pro).

I need to be able to extract term offsets from my queries, for highlighting purposes, but unfortunately the MemoryIndex in the current version of Lucene doesn’t support that. Happily, though, Lucene is an open source project. So I opened a ticket in JIRA, and submitted a patch, which was accepted almost immediately.

My first proper open source contribution. It’s amazingly satisfying, and probably addictive.

NOT WITHIN queries in Lucene

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects.  dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for "fish" within two positions of "chips", you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2. 

You can also search for terms that are not within X positions of another term.  This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

    fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is "fish" NOT WITHIN/2 "chips". A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal "fish" WITHIN/2 "chips" query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for "George" within 2 positions of "Bush", not overlapping with "H".

In our specific case, we want to find instances of “fish” that do not overlap with Spans of "fish" within/2 "chips". So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
                                                distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.