0

I'm working with the Enron-dataset in elasticsearch. The mail bodies are split into paragraphs which are stored as nested documents. But that is besides the point, I just want you to make sense of the query itself.
I wanted to verify that everything's works as expected, so I looked for an uncommon word in the corpus and wanted to query for it. My intention was to check, whether the total hit value would be correct. I was confused, because I always got a value of 10000 which was far to high.
I chose the word electrons, which occurs some times in the corpus. However, my query also matches electronic which is contained in practically every mail in the corpus (I exaggerate).
Here's my query:

curl -X GET "localhost:9200/enron/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "_source": {
    "includes": [ "*" ],
    "excludes": [ "body" ]
  },
  "query": {
    "nested": {
      "path": "body",
      "inner_hits": {       

      },
      "query": {
        "constant_score" : {
            "filter": {
                "match": {
                    "body.content" : "electrons" 
                }
            }
        }
      }
    }
  }
}
'

Don't mind all the stuff around it. It looks like this, because I'm only interested in the paragraphs containing the word electrons. This already is a test query for finding out what's going on under the hood. It returns the document, and only the inner document (body) with the matched term.
I suspected that the match filter was the culprit. So I changed the query in the filter to match_phrase. However, that didn't change anything.

How can I match the word electrons in a text field (inside a nested document) without matching electronic and other similar words?

Edit:
The suggested Term query is not recommended for Text fields. Asides, it wrongfully returns 0 hits:

{
  "_source": {
    "includes": [ "*" ],
    "excludes": [ "body" ]
  },
  "query": {
    "nested": {
      "path": "body",
      "inner_hits": {       

      },
      "query": {
        "constant_score" : {
            "filter": {
                "term": {
                    "body.content" : "electrons" 
                }
            }
        }
      }
    }
  }
}
'
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Edit2:
I guess, I have found the error. The analyzer was set to snowball for the text field. No wonder it didn't find an exact match for the term.
I'm reindexing.

Edit3:
It was my fault. It works with the standard analyzer. ES also finds the correct Documents using match, by the way.

Angus
  • 109
  • 1
  • 4

1 Answers1

0

Use term filter instead of match for exact word, see the explanation below

What is the difference between a term query and a match one?

Developer
  • 299
  • 1
  • 2
  • 11
  • Sorry, I forgot to mention it, but I already tried the Term query. It returns 0 hits. It is also not recommended for text fields according to the documentation. – Angus Dec 10 '19 at 08:10