Solr Cell & schemaless indexing with Apache Tika

Key Solr Cell Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

Tika will automatically attempt to determine the input document type (e.g., Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the stream.type parameter. See http://tika.apache.org/1.24.1/formats.html for the file types supported.
Briefly, Tika internally works by synthesizing an XHTML document from the core content of the parsed document which is passed to a configured SAX ContentHandler provided by Solr Cell. Solr responds to Tika’s SAX events to create one or more text fields from the content. Tika exposes document metadata as well (apart from the XHTML).
Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. The metadata available is highly dependent on the file types and what they in turn contain. Some of the general metadata created is described in the section Metadata Created by Tika below. Solr Cell supplies some metadata of its own too.
Solr Cell concatenates text from the internal XHTML into a content field. You can configure which elements should be included/ignored, and which should map to another field.
Solr Cell maps each piece of metadata onto a field. By default it maps to the same name but several parameters control how this is done.
When Solr Cell finishes creating the internal SolrInputDocument, the rest of the Lucene/Solr indexing stack takes over. The next step after any update handler is the Update Request Processor chain.

Solr Cell is a contrib, which means it’s not automatically included with Solr but must be configured. The example configsets have Solr Cell configured, but if you are not using those, you will want to pay attention to the section Configuring the ExtractingRequestHandler in solrconfig.xml below.

For more information see:

https://solr.apache.org/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html

digitale-akademie.adw-goe.de

ADW Göttingen
Theaterstr. 7

digiberichte.de

curated by
Digitale Akademie

digitale-akademie.adw-goe.de

ADW Göttingen
Geiststrasse 10

www.inschriften.net

ADW Göttingen / ADW Mainz

digitale-akademie.adw-goe.de

ADW Göttingen
Friedländer Weg 11

coptot.manuscriptroom.com

ADW Göttingen
Friedländer Weg 12

sub.uni-goettingen.de

Digitale Akademie
Platz der Göttinger Sieben 1

fwb-online.de

ADW Göttingen
Geiststrasse 10

resikom.adw-goe.de

ADW Göttingen
Arbeitsstelle Kiel

klosterdatenbank.germania-sacra.de

ADW Göttingen
Geiststrasse 10

Solr Cell & schemaless indexing with Apache Tika

Key Solr Cell Concepts

ADW GöttingenTheaterstr. 7

Key Solr Cell Concepts

ADW Göttingen
Theaterstr. 7