0

We are using Nutch to crawl some sites, pushing index to Elasticsearch and using a custom UI to search by calling Elasticsearch APIs.

Problem is that I need to crawl some sites but exclude them from the Elasticsearch index (For example, I need to crawl A, B and C, but exclude B from the index) We could not find a solution that we could implement during the pushing index to Elasticsearch stage, therefore we decided to try to filter on the Elasticsearch querying page.

The elasticsearch index (which nutch creates) contains a URL field. This is perfect, but the problem is that Elasticsearch (as I understand it) parsed this using the full text method, where for example http://www.somesite.com is actually parsed into 4 or more keyboards (http,www,somesite,com). I cannot figure out how to build an Elasticsearch query that will, for example, exclude these URLs:

http://www.somesite.com/contact/

http://www.somesite.com/privacy/

When I am running my DSL query, it seems like it is breaking it up by pieces (I mean http,www,somesite,com) and ORing them together, which always returns all the results.

For example:

{
    "query": {
        "must": { "match": { "url": "http://www.somesite.com/page1" }}
    }
}

Always returns all the results.

Has anyone done something like this?

ColdAir
  • 55
  • 1
  • 8

2 Answers2

0

For regex try-

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html

You can index url as keyword and just do term query on that or don't analyse this field by setting not_analysed in mapping.

Keyword Tokenizer -

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-keyword-tokenizer.html

Read about not_analysed here-

https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html

But if you want to filter based on domains and path better separate host and path at application level and then index separately.

xrage
  • 4,690
  • 4
  • 25
  • 31
  • Nutch sends the data to E.S., so I have no control over what type of fields E.S. creates. I would prefer not to modify them later because next time we need to re-create the index, I will have to remember to apply the steps. What is the difference between term and query ? – ColdAir Jan 29 '18 at 19:51
  • I feel you are asking difference between term and match query. https://stackoverflow.com/a/40124408/1523811 – xrage Jan 30 '18 at 05:34
  • You can try regexp also. – xrage Jan 30 '18 at 05:35
0

You haven't specified which version of Nutch you're using. There are a couple of ways that you could accomplish the same using on the Nutch side (without taking into account your backend Solr/ES).

One option is that you can implement your own IndexingFilter (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexingFilter.java#L55-L56) in the filter method if you return null the document will be discarded (ie. not indexed) so in your implementation you could have your own logic (like matching on the hostname of the URL) and reject those documents that you don't want to index.

In Nutch v1.14 a new generic way of doing the same was added, a new IndexingFilter that supports generic JEXL expressions (https://commons.apache.org/proper/commons-jexl/reference/syntax.html) was added, so you could exclude some documents that match a specific condition. The advantage of this approach is that you don't need to write any code. The downside is that right now the hostname is not available for filtering, but you could use the URL with a regex expression. Actually, if you use the index-basic filter then the hostname should be available under the doc.host key in the JEXL context.

As for the ES side of the question, by default Nutch doesn't enforce any mapping when sending documents to ES. This means that ES will try to look at the content of each field and then decide the best possible mapping. For instance, in this case, I think is mapping the url field as a generic text field (which indeed splits the content into tokens).

One solution could be before sending the documents to ES (indexing on Nutch) you could manually create a mapping for those fields that you know which type do you want, for instance, the url as a string/keyword. Keep in mind that this should not cause any problem with Nutch as long as you don't set conflicting mappings, but usually settings everything as a string/keyword and seeing which fields are created is enough for defining your own mapping.

But if you want only to filter the content before indexing then the other 2 solutions are probably the best approach.

Jorge Luis
  • 3,098
  • 2
  • 16
  • 21