5

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs.

Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?

I have been doing tweaks to the default schema to attempt to get facets working on date modification times, but even without that, I figure there could very well exist a good example of how these files should be when the default output from Tika is enough.

If there is no such thing as a best-practice schema.xml and/or solrconfig.xml I'm also interested in good examples, preferably from existing open source projects or even good blog posts.

Any pointers are welcome!

javanna
  • 59,145
  • 14
  • 144
  • 125
Pål Brattberg
  • 4,568
  • 29
  • 40

1 Answers1

0

In the books Taming Text (http://www.manning.com/ingersoll/) you have some reference to ExtractingRequestHandler. This book it's about processing text using open source tools such as solr, tika or lucene.

I've read until chapter 5 and until now the book explain how extends the solr functionality by modifing the file schema.xml for create diferents type of fields, and procesing in query or indexing.

josegil
  • 365
  • 1
  • 8