Key Solr Cell Concepts
When using the Solr Cell framework, it is helpful to keep the following in mind:
- Tika will automatically attempt to determine the input document type (e.g., Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the
stream.type
parameter. See http://tika.apache.org/1.24.1/formats.html for the file types supported. - Briefly, Tika internally works by synthesizing an XHTML document from the core content of the parsed document which is passed to a configured SAX ContentHandler provided by Solr Cell. Solr responds to Tika’s SAX events to create one or more text fields from the content. Tika exposes document metadata as well (apart from the XHTML).
- Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. The metadata available is highly dependent on the file types and what they in turn contain. Some of the general metadata created is described in the section Metadata Created by Tika below. Solr Cell supplies some metadata of its own too.
- Solr Cell concatenates text from the internal XHTML into a
content
field. You can configure which elements should be included/ignored, and which should map to another field. - Solr Cell maps each piece of metadata onto a field. By default it maps to the same name but several parameters control how this is done.
- When Solr Cell finishes creating the internal
SolrInputDocument
, the rest of the Lucene/Solr indexing stack takes over. The next step after any update handler is the Update Request Processor chain.
Solr Cell is a contrib, which means it’s not automatically included with Solr but must be configured. The example configsets have Solr Cell configured, but if you are not using those, you will want to pay attention to the section Configuring the ExtractingRequestHandler in solrconfig.xml below.
For more information see:
https://solr.apache.org/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html