NOT WITHIN queries in Lucene

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects.  dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for "fish" within two positions of "chips", you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2. 

You can also search for terms that are not within X positions of another term.  This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

    fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is "fish" NOT WITHIN/2 "chips". A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal "fish" WITHIN/2 "chips" query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for "George" within 2 positions of "Bush", not overlapping with "H".

In our specific case, we want to find instances of “fish” that do not overlap with Spans of "fish" within/2 "chips". So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
                                                distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.