Solr Cell & schemaless indexing with Apache Tika

Key Solr Cell Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

  • Tika will automatically attempt to determine the input document type (e.g., Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the stream.type parameter. See http://tika.apache.org/1.24.1/formats.html for the file types supported.
  • Briefly, Tika internally works by synthesizing an XHTML document from the core content of the parsed document which is passed to a configured SAX ContentHandler provided by Solr Cell. Solr responds to Tika’s SAX events to create one or more text fields from the content. Tika exposes document metadata as well (apart from the XHTML).
  • Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. The metadata available is highly dependent on the file types and what they in turn contain. Some of the general metadata created is described in the section Metadata Created by Tika below. Solr Cell supplies some metadata of its own too.
  • Solr Cell concatenates text from the internal XHTML into a content field. You can configure which elements should be included/ignored, and which should map to another field.
  • Solr Cell maps each piece of metadata onto a field. By default it maps to the same name but several parameters control how this is done.
  • When Solr Cell finishes creating the internal SolrInputDocument, the rest of the Lucene/Solr indexing stack takes over. The next step after any update handler is the Update Request Processor chain.

Solr Cell is a contrib, which means it’s not automatically included with Solr but must be configured. The example configsets have Solr Cell configured, but if you are not using those, you will want to pay attention to the section Configuring the ExtractingRequestHandler in solrconfig.xml below.

For more information see:




GitHub repo: https://github.com/netarchivesuite/solrwayback

About SolrWayback

SolrWayback is a web application for browsing historical harvested ARC/WARC files similar to the Internet Archive Wayback Machine. SolrWayback runs on a Solr server containing ARC/WARC files indexed using the warc-indexer.

SolrWayback bundle release 4.4.2 can be downloaded here: https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2

The bundle is the recommended way to get started with SolrWayback. You download the bundle, follow the installation guide and index your own WARC files. Then you are up to speed.


## Install instructions from ReadMe file once SolrWayback bundle has been downloaded.

Copy the two files `properties/solrwayback.properties` and `properties/solrwaybackweb.properties` to your HOME folder (or the home-folder for Tomcat user)

Optional: For screenshot previews to work you may have to edit `solrwayback.properties` and change the value of the last two properties : `chrome.command` and `screenshot.temp.imagedir`.
Chrome(Chromium) must has to be installed for screenshot preview images.

If there are errors when running a script, try change the permissions for the file (`startup.sh` etc). Linux: `chmod +x filename.sh`

SolrWayback requires both Solr and Tomcat to be running.

#### Tomcat:

* Start tomcat: `apache-tomcat-8.5.60/bin/startup.sh`
* Stop tomcat: `apache-tomcat-8.5.60/bin/shutdown.sh`
* (For windows navigate to `apache-tomcat-8.5.60/bin/` and type `startup.bat` or `shutdown.bat`)
* To see Tomcat is running open: http://localhost:8080/solrwayback/

#### Solr:
* Start solr: `solr-7.7.3/bin/solr start`
* Stop solr: `solr-7.7.3/bin/solr stop -all`
* (For windows navigate to `solr-7.7.3/bin/` and type `solr.cmd start` or `solr.cmd stop -all`)
* To see Solr is running open: http://localhost:8983/solr/#/netarchivebuilder

SolrWayback uses a Solr index of WARC files to support freetext search and more complex queries.
If you do not have existing WARC files, see steps below on harvesting with wget.

The script `warc-indexer.sh` in the `indexing`-folder allows for multi processing and keeps track of already
indexed files, so the collection can be extended by adding more WARCs and running the script again.

Call `indexing/warc-indexer.sh -h` for usage and how to adjust the number of processes to use for indexing.
Example usage that will index all WARC-files in the warcs1 folder.
THREADS=2 ./warc-indexer.sh warcs1/*

This will start indexing files from the warcs1 folder using 2 threads. Assigning a higher number of threads than CPU cores available will result in slower indexing. Each indexing job require 1GB ram, so this can also be a limiting factor.

You can also populate the collection and collectionid field in Solr with custom values:
THREADS=4 INDEXER_CUSTOM=“–collection_id collection1 –collection corona2021″ ./warc-indexer.sh warcs1/*

You can then enable facetting on these fields in solrwaybackweb.properties.


The script keeps track of processed files by checking if a log from a previous analysis is available. The logs are stored
in the `status`-folder (this can be changed using the `STATUS_ROOT` variable). To re-index a WARC file, delete the
corresponding log file.

The script `warc-indexer.sh` is not available for Windows. For windows platform only a more primitive script is provided that also works for Linux/MacOs.
1. Copy ARC/WARC files into folder: `indexing/warcs1`
2. Start indexing: call `indexing/batch_warcs1_folder.sh` (or batch_warcs1_folder.bat for windows)

Indexing can take up to 20 minutes for 1GB warc-files. After indexing, the warc-files must stay in the same folder since SolrWayback is using them during playback etc.

Having whitespace characters in WARC file names can result in pagepreviews and playback not working on some systems.
There can be up to 5 minutes delay before the indexed files are visible from search. Visit this url after index job have finished to commit them instantly: http://localhost:8983/solr/netarchivebuilder/update?commit=true
There is a batch_warcs2_folder.sh similar script to show how to easily add new WARC files to the collection without indexing the old ones again.

For more information about the warc-indexer see: https://github.com/ukwa/webarchive-discovery/wiki/Quick-Start