I'm trying to implement a simple name search using elasticsearch and a java app as the client.
Schema:
{
"firstname": {
"type": "string",
"analyzer": "standard"
},
"lastname": {
"type": "string",
"analyzer": "standard"
},
"fullname": {
"type": "string",
"analyzer": "standard"
}
}
Query:
{
"query": {
"bool": {
"must": [
{
"match": {
"fullname": {
"query": "Michael Jordan",
"type": "boolean",
"fuzziness": 1,
"minimum_should_match": "100%"
}
}
},
{
"match": {
"lastname": {
"query": "Michael Jordan",
"type": "boolean",
"fuzziness": 1
}
}
}
]
}
}
}
Code:
String fullname = QueryParser.escape(queryFullname);
BoolQueryBuilder queryBuilder = boolQuery()
.must(matchQuery("fullname", fullname).fuzziness(fuzziness).minimumShouldMatch("100%"))
.must(matchQuery("lastname", fullname).fuzziness(fuzziness));
Requirements:
- Use the existing mapping. This means that I can't change the analyzer. This doesn't affect a lot, as the only thing I would add is asciifolding.
- Return results when searching only with lastname, but do not when searching only with firstname
- Allow space for errors in the query param. This is the reason for the fuzziness, it also takes care the absence of asciifolding, although it narrows the window error.
Explanation:
- I'm using
mustin fullname and lastname only, since searching with firstname would break one of the requirements. - I'm using minimum should match of 100% in fullname since I want all query terms to have a match with the document terms in order to mark it as a result. There is a window of error through fuzziness.
- The reason I'm using fullname too, and not only last name is because I want to get better scores when the whole fullname is a match.
- The reason I'm using the lastname can be better explained through an example
Documents:
- fullname: Paul Foo Bar, firstname: Paul, lastname: Foo Bar
- fullname: Paul Alkis Desk, firstname: Paul, lastname: Alkis Desk
Query:
- Paul Bar
The query as is, will bring Paul Foo Bar which I want. If I use only the fullname and remove minimum should match it will bring both (because of Paul) which I don't want. If I keep and play with minimum should match the results will become very inconsistent.
My question is, is the logic good enough? Is there a better (i.e. faster or more reliable) way to do what I need without changing the mapping? I haven't tested the performance yet, but can you spot any major performance pitfalls?
I'm using ElasticSearch 1.7.5
p.s. If you find this to be utterly wrong and for a variety of reasons, then feel free to say so and point me to some useful reading material. I'm a new to elasticsearch.