SolrWayback
GitHub repo: https://github.com/netarchivesuite/solrwayback
SolrWayback is a web application for browsing historical harvested ARC/WARC files similar to the Internet Archive Wayback Machine. SolrWayback runs on a Solr server containing ARC/WARC files indexed using the warc-indexer.
SolrWayback bundle release 4.4.2 can be downloaded here: https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2
The bundle is the recommended way to get started with SolrWayback. You download the bundle, follow the installation guide and index your own WARC files. Then you are up to speed.
## Install instructions from ReadMe file once SolrWayback bundle has been downloaded.
### 1) INITIAL SETUP
Properties:
Copy the two files `properties/solrwayback.properties` and `properties/solrwaybackweb.properties` to your HOME folder (or the home-folder for Tomcat user)
Optional: For screenshot previews to work you may have to edit `solrwayback.properties` and change the value of the last two properties : `chrome.command` and `screenshot.temp.imagedir`.
Chrome(Chromium) must has to be installed for screenshot preview images.
If there are errors when running a script, try change the permissions for the file (`startup.sh` etc). Linux: `chmod +x filename.sh`
### 2) STARTING SOLRWAYBACK
SolrWayback requires both Solr and Tomcat to be running.
#### Tomcat:
* Start tomcat: `apache-tomcat-8.5.60/bin/startup.sh`
* Stop tomcat: `apache-tomcat-8.5.60/bin/shutdown.sh`
* (For windows navigate to `apache-tomcat-8.5.60/bin/` and type `startup.bat` or `shutdown.bat`)
* To see Tomcat is running open: http://localhost:8080/solrwayback/
#### Solr:
* Start solr: `solr-7.7.3/bin/solr start`
* Stop solr: `solr-7.7.3/bin/solr stop -all`
* (For windows navigate to `solr-7.7.3/bin/` and type `solr.cmd start` or `solr.cmd stop -all`)
* To see Solr is running open: http://localhost:8983/solr/#/netarchivebuilder
### 3) INDEXING
SolrWayback uses a Solr index of WARC files to support freetext search and more complex queries.
If you do not have existing WARC files, see steps below on harvesting with wget.
The script `warc-indexer.sh` in the `indexing`-folder allows for multi processing and keeps track of already
indexed files, so the collection can be extended by adding more WARCs and running the script again.
Call `indexing/warc-indexer.sh -h` for usage and how to adjust the number of processes to use for indexing.
Example usage that will index all WARC-files in the warcs1 folder.
„`
THREADS=2 ./warc-indexer.sh warcs1/*
„`
This will start indexing files from the warcs1 folder using 2 threads. Assigning a higher number of threads than CPU cores available will result in slower indexing. Each indexing job require 1GB ram, so this can also be a limiting factor.
You can also populate the collection and collectionid field in Solr with custom values:
„`
THREADS=4 INDEXER_CUSTOM=“–collection_id collection1 –collection corona2021″ ./warc-indexer.sh warcs1/*
„`
You can then enable facetting on these fields in solrwaybackweb.properties.
The script keeps track of processed files by checking if a log from a previous analysis is available. The logs are stored
in the `status`-folder (this can be changed using the `STATUS_ROOT` variable). To re-index a WARC file, delete the
corresponding log file.
The script `warc-indexer.sh` is not available for Windows. For windows platform only a more primitive script is provided that also works for Linux/MacOs.
1. Copy ARC/WARC files into folder: `indexing/warcs1`
2. Start indexing: call `indexing/batch_warcs1_folder.sh` (or batch_warcs1_folder.bat for windows)
Indexing can take up to 20 minutes for 1GB warc-files. After indexing, the warc-files must stay in the same folder since SolrWayback is using them during playback etc.
Having whitespace characters in WARC file names can result in pagepreviews and playback not working on some systems.
There can be up to 5 minutes delay before the indexed files are visible from search. Visit this url after index job have finished to commit them instantly: http://localhost:8983/solr/netarchivebuilder/update?commit=true
There is a batch_warcs2_folder.sh similar script to show how to easily add new WARC files to the collection without indexing the old ones again.
For more information about the warc-indexer see: https://github.com/ukwa/webarchive-discovery/wiki/Quick-Start