0

I have managed to get apache nutch to index a news website and pass the results off to Apache solr.

Using this tutorial https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.

As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.

Question 1:

How to parse data from the webpage, to extract the date and the title.

I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.

http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/

Another SO article is this

How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.

Anything would be a great help.

Community
  • 1
  • 1
user3279550
  • 108
  • 2
  • 8

1 Answers1

0

Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.

That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:

<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
    fields="title,pubdate,anotherone,etc"/>

You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html