When you search for the term ‘annual leave’, you’re more interested, generally speaking, in documents that have those two terms next to each other than in documents that contain the two terms seperated by a few pages. You want the proximity of the several terms to be taken into account when scoring documents for relevancy.
Elasticsearch (as of version 0.19.8) doesn’t have proximity boosting of this kind built-in, but it is possible to implement it by modifying your queries. Here’s how:
- Take your multi-term query, and split it into shingles. So the query “correct horse battery staple”, for example, is shingled into three separate queries, “correct horse”, “horse battery” and “battery staple”
- Wrap your original query in a boolean query, with the original query as the sole ‘must’ clause.
- Add the shingles as phrase queries in ‘should’ clauses.
Under the hood, Lucene scores boolean queries by summing the subscores of ‘should’ clauses. Each instance of a shingle in a document will increase the document’s score, acting as a proximity boost. For even more boosting, you can add the original query reformulated as a phrase query to the boolean as well.
Our final query looks something like this:
{
'bool': {
'must' : {
'query_string' : {
'field' : 'text',
'query' : 'correct horse battery staple'
}
},
'should' : [
'text_phrase' : { 'text' : 'correct horse' },
'text_phrase' : { 'text' : 'horse battery' },
'text_phrase' : { 'text' : 'battery staple' },
'text_phrase' : { 'text' : 'correct horse battery staple' },
]
}
}
Depending on your corpus and the type of queries you use, you can also add shingles of different lengths, and/or boost the shingles. As with all things in relevancy, the actual boost numbers should be determined by experiment.