0

I would like to count each token analyzed.

First, I tried following codes:

mapping:

{
  "docs": {
    "mappings": {
      "doc": {
        "dynamic": "false",
        "properties": {
          "text": {
            "type": "string",
            "analyzer": "kuromoji"
          }
        }
      }
    }
  }
}

query:

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "word-count": {
      "terms": {
        "field": "text",
        "size": "1000"
      }
    }
  },
  "size": 0
}

I queried my index after inserting my data, I got a following result:

{
  "took": 41
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 10000,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "word-count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 36634,
      "buckets": [
        {
          "key": "はい",
          "doc_count": 4734
        },
        {
          "key": "いただく",
          "doc_count": 2440
        },
        ...
      ]
    }
  }
}

Unfortunately, term aggregation provides only a doc_count. It's not a word count. So, I think the way to get approximate word count using _index['text']['TERM'].df() and _index['text']['TERM'].ttf().

Maybe the approximate word count is the following equation:

WordCount = doc_count['TERM'] / _index['text']['TERM'].df() * _index['text']['TERM'].ttf()

'TERM' is key in buckets. I tried to write a scripted metric aggregation, but i didn't know how to get keys in buckets.

{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "doc-count": {
      "terms": {
        "field": "text",
        "size": "1000"
      }
    },
    "aggs": {
      "word-count": {
        "scripted_metric": {
           // ???
        }
      }
    }
  },
  "size": 0
}

How can I get keys in buckets? If it is impossible, how can I get a analyzed word count?

2 Answers2

0

You can try with the token count data type. Simply add a sub-field of that type to your text field:

{
  "docs": {
    "mappings": {
      "doc": {
        "dynamic": "false",
        "properties": {
          "text": {
            "type": "string",
            "analyzer": "kuromoji"
          }, 
          "fields": {
            "nb_tokens": {
              "type": "token_count",
              "analyzer": "kuromoji"
            }
          }
        }
      }
    }
  }
}

Then you can use text.nb_tokens in your aggregation.

Val
  • 207,596
  • 13
  • 358
  • 360
  • Thank you! I tried token count, but it was not suitable for my case. Token count counts tokens of each document, but i want to count same token in whole documents. For example, text "だからですね" is counted 3 tokens. But I want to get "だから", "です", and "ね" are counted each 1 token, and aggregate tokens in whole documents in order to get each token count . – Masao Kinoshita Mar 16 '16 at 05:46
0

Can you try dynamic_scripting,though this will affect performance..

{
"query": {
"match_all": {}
},
"aggs": {
"word-count": {
  "terms": {
    "script": "_source.text",
    "size": "1000"
    }
  }
 },
"size": 0
}
Richa
  • 7,419
  • 6
  • 25
  • 34