I'm working with the Enron-dataset in elasticsearch.
The mail bodies are split into paragraphs which are stored as nested documents. But that is besides the point, I just want you to make sense of the query itself.
I wanted to verify that everything's works as expected, so I looked for an uncommon word in the corpus and wanted to query for it. My intention was to check, whether the total hit value would be correct. I was confused, because I always got a value of 10000 which was far to high.
I chose the word electrons
, which occurs some times in the corpus. However, my query also matches electronic
which is contained in practically every mail in the corpus (I exaggerate).
Here's my query:
curl -X GET "localhost:9200/enron/_search?pretty" -H 'Content-Type: application/json' -d'
{
"_source": {
"includes": [ "*" ],
"excludes": [ "body" ]
},
"query": {
"nested": {
"path": "body",
"inner_hits": {
},
"query": {
"constant_score" : {
"filter": {
"match": {
"body.content" : "electrons"
}
}
}
}
}
}
}
'
Don't mind all the stuff around it. It looks like this, because I'm only interested in the paragraphs containing the word electrons
. This already is a test query for finding out what's going on under the hood. It returns the document, and only the inner document (body) with the matched term.
I suspected that the match
filter was the culprit. So I changed the query in the filter to match_phrase
. However, that didn't change anything.
How can I match the word electrons
in a text field (inside a nested document) without matching electronic
and other similar words?
Edit:
The suggested Term query is not recommended for Text fields. Asides, it wrongfully returns 0 hits:
{
"_source": {
"includes": [ "*" ],
"excludes": [ "body" ]
},
"query": {
"nested": {
"path": "body",
"inner_hits": {
},
"query": {
"constant_score" : {
"filter": {
"term": {
"body.content" : "electrons"
}
}
}
}
}
}
}
'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
Edit2:
I guess, I have found the error. The analyzer was set to snowball
for the text field. No wonder it didn't find an exact match for the term.
I'm reindexing.
Edit3:
It was my fault. It works with the standard analyzer. ES also finds the correct Documents using match
, by the way.