We are using Nutch to crawl some sites, pushing index to Elasticsearch and using a custom UI to search by calling Elasticsearch APIs.
Problem is that I need to crawl some sites but exclude them from the Elasticsearch index (For example, I need to crawl A, B and C, but exclude B from the index) We could not find a solution that we could implement during the pushing index to Elasticsearch stage, therefore we decided to try to filter on the Elasticsearch querying page.
The elasticsearch index (which nutch creates) contains a URL field. This is perfect, but the problem is that Elasticsearch (as I understand it) parsed this using the full text method, where for example http://www.somesite.com is actually parsed into 4 or more keyboards (http,www,somesite,com). I cannot figure out how to build an Elasticsearch query that will, for example, exclude these URLs:
http://www.somesite.com/contact/
http://www.somesite.com/privacy/
When I am running my DSL query, it seems like it is breaking it up by pieces (I mean http,www,somesite,com) and ORing them together, which always returns all the results.
For example:
{
"query": {
"must": { "match": { "url": "http://www.somesite.com/page1" }}
}
}
Always returns all the results.
Has anyone done something like this?