Is there a best practice schema.xml for SOLR when importing rich documents?

Question

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs.

Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?

I have been doing tweaks to the default schema to attempt to get facets working on date modification times, but even without that, I figure there could very well exist a good example of how these files should be when the default output from Tika is enough.

If there is no such thing as a best-practice schema.xml and/or solrconfig.xml I'm also interested in good examples, preferably from existing open source projects or even good blog posts.

Any pointers are welcome!

score 0 · Answer 1 · answered Dec 09 '11 at 14:04

0

In the books Taming Text (http://www.manning.com/ingersoll/) you have some reference to ExtractingRequestHandler. This book it's about processing text using open source tools such as solr, tika or lucene.

I've read until chapter 5 and until now the book explain how extends the solr functionality by modifing the file schema.xml for create diferents type of fields, and procesing in query or indexing.

answered Dec 09 '11 at 14:04

josegil

365
1
8

Ok, if you find something concerning best practices or so, be sure to update your answer. Thanks – Pål Brattberg Dec 10 '11 at 14:57

Is there a best practice schema.xml for SOLR when importing rich documents?

1 Answers1