A B C D E F G H I J K L M N O P Q R S T U V W X Z

SolrWayback

SolrWayback

GitHub repo: https://github.com/netarchivesuite/solrwayback

About SolrWayback

SolrWayback is a web application for browsing historical harvested ARC/WARC files similar to the Internet Archive Wayback Machine. SolrWayback runs on a Solr server containing ARC/WARC files indexed using the warc-indexer.

SolrWayback bundle release 4.4.2 can be downloaded here: https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2

The bundle is the recommended way to get started with SolrWayback. You download the bundle, follow the installation guide and index your own WARC files. Then you are up to speed.

 

## Install instructions from ReadMe file once SolrWayback bundle has been downloaded.

### 1) INITIAL SETUP
Properties:
Copy the two files `properties/solrwayback.properties` and `properties/solrwaybackweb.properties` to your HOME folder (or the home-folder for Tomcat user)

Optional: For screenshot previews to work you may have to edit `solrwayback.properties` and change the value of the last two properties : `chrome.command` and `screenshot.temp.imagedir`.
Chrome(Chromium) must has to be installed for screenshot preview images.

If there are errors when running a script, try change the permissions for the file (`startup.sh` etc). Linux: `chmod +x filename.sh`

### 2) STARTING SOLRWAYBACK
SolrWayback requires both Solr and Tomcat to be running.

#### Tomcat:

* Start tomcat: `apache-tomcat-8.5.60/bin/startup.sh`
* Stop tomcat: `apache-tomcat-8.5.60/bin/shutdown.sh`
* (For windows navigate to `apache-tomcat-8.5.60/bin/` and type `startup.bat` or `shutdown.bat`)
* To see Tomcat is running open: http://localhost:8080/solrwayback/

#### Solr:
* Start solr: `solr-7.7.3/bin/solr start`
* Stop solr: `solr-7.7.3/bin/solr stop -all`
* (For windows navigate to `solr-7.7.3/bin/` and type `solr.cmd start` or `solr.cmd stop -all`)
* To see Solr is running open: http://localhost:8983/solr/#/netarchivebuilder

### 3) INDEXING
SolrWayback uses a Solr index of WARC files to support freetext search and more complex queries.
If you do not have existing WARC files, see steps below on harvesting with wget.

The script `warc-indexer.sh` in the `indexing`-folder allows for multi processing and keeps track of already
indexed files, so the collection can be extended by adding more WARCs and running the script again.

Call `indexing/warc-indexer.sh -h` for usage and how to adjust the number of processes to use for indexing.
Example usage that will index all WARC-files in the warcs1 folder.
„`
THREADS=2 ./warc-indexer.sh warcs1/*
„`

This will start indexing files from the warcs1 folder using 2 threads. Assigning a higher number of threads than CPU cores available will result in slower indexing. Each indexing job require 1GB ram, so this can also be a limiting factor.

You can also populate the collection and collectionid field in Solr with custom values:
„`
THREADS=4 INDEXER_CUSTOM=“–collection_id collection1 –collection corona2021″ ./warc-indexer.sh warcs1/*
„`

You can then enable facetting on these fields in solrwaybackweb.properties.

 

The script keeps track of processed files by checking if a log from a previous analysis is available. The logs are stored
in the `status`-folder (this can be changed using the `STATUS_ROOT` variable). To re-index a WARC file, delete the
corresponding log file.

The script `warc-indexer.sh` is not available for Windows. For windows platform only a more primitive script is provided that also works for Linux/MacOs.
1. Copy ARC/WARC files into folder: `indexing/warcs1`
2. Start indexing: call `indexing/batch_warcs1_folder.sh` (or batch_warcs1_folder.bat for windows)

Indexing can take up to 20 minutes for 1GB warc-files. After indexing, the warc-files must stay in the same folder since SolrWayback is using them during playback etc.

Having whitespace characters in WARC file names can result in pagepreviews and playback not working on some systems.
There can be up to 5 minutes delay before the indexed files are visible from search. Visit this url after index job have finished to commit them instantly: http://localhost:8983/solr/netarchivebuilder/update?commit=true
There is a batch_warcs2_folder.sh similar script to show how to easily add new WARC files to the collection without indexing the old ones again.

For more information about the warc-indexer see: https://github.com/ukwa/webarchive-discovery/wiki/Quick-Start

Specification for an OAI Static Repository and an OAI Static Repository Gateway

 

http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm

 

A Static Repository, introduced here, provides a simple approach for exposing relatively static and small collections of metadata records through the OAI-PMH. The Static Repository approach is targeted at organizations that:

  • Have metadata collections ranging in size between 1 and 5000 records;
  • Can make static content available through a network-accessible Web server;
  • Need a technically simpler implementation strategy compared to acting as an OAI-PMH Repository, which requires processing OAI-PMH requests.

A Static Repository is an XML file that is made accessible at a persistent HTTP URL. The XML file contains metadata records and repository information.

A Static Repository becomes accessible via OAI-PMH through the intermediation of one Static Repository Gateway. The restriction that only one Static Repository Gateway acts as an intermediary for each Static Repository reduces potential problems with large-scale duplication of metadata records among OAI-PMH repositories. A Static Repository Gateway uses the metadata records and repository information, provided via XML in the Static Repository, to respond to the six OAI-PMH requests for access to that information. Because a Static Repository Gateway maps a unique Static Repository base URL to each such Static Repository, harvesters can access a Static Repository in exactly the same manner as they access any other OAI-PMH Repository.

The relationship between Static Repositories, a Static Repository Gateway, and an OAI-PMH harvester is illustrated in the figure below. The Static Repository and the Static Repository Gateway are described in the remainder of this document. Implementers whose sole interest is the creation of a Static Repository may skip Section 4 that describes the Static Repository Gateway.

SphinxSE MySQL Full Text Search

SphinxSE Full Text Search engine:

http://sphinxsearch.com/docs/current.html#about

Sphinx is a full-text search engine, publicly distributed under GPL version 2. Commercial licensing (eg. for embedded use) is available upon request.

Technically, Sphinx is a standalone software package provides fast and relevant full-text search functionality to client applications. It was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by scripting languages. However, Sphinx does not depend on nor require any specific database to function.

Applications can access Sphinx search daemon (searchd) using any of the three different access methods: a) via Sphinx own implementation of MySQL network protocol (using a small SQL subset called SphinxQL, this is recommended way), b) via native search API (SphinxAPI) or c) via MySQL server with a pluggable storage engine (SphinxSE).

Official native SphinxAPI implementations for PHP, Perl, Python, Ruby and Java are included within the distribution package. API is very lightweight so porting it to a new language is known to take a few hours or days. Third party API ports and plugins exist for Perl, C#, Haskell, Ruby-on-Rails, and possibly other languages and frameworks.

Starting from version 1.10-beta, Sphinx supports two different indexing backends: „disk“ index backend, and „realtime“ (RT) index backend. Disk indexes support online full-text index rebuilds, but online updates can only be done on non-text (attribute) data. RT indexes additionally allow for online full-text index updates. Previous versions only supported disk indexes.

Data can be loaded into disk indexes using a so-called data source. Built-in sources can fetch data directly from MySQL, PostgreSQL, MSSQL, ODBC compliant database (Oracle, etc) or a pipe in TSV or a custom XML format. Adding new data sources drivers (eg. to natively support other DBMSes) is designed to be as easy as possible. RT indexes, as of 1.10-beta, can only be populated using SphinxQL.

http://www.sphinxsearch.com/wiki/doku.php?id=tutorials

2.3. Installing Sphinx packages on Debian and Ubuntu

There are two ways of getting Sphinx for Ubuntu: regular deb packages and the Launchpad PPA repository.

Deb packages:

  1. Sphinx requires a few libraries to be installed on Debian/Ubuntu. Use apt-get to download and install these dependencies:$ sudo apt-get install mysql-client unixodbc libpq5
  2. Now you can install Sphinx:$ sudo dpkg -i sphinxsearch_2.2.11-dev-0ubuntu12~trusty_amd64.deb

PPA repository (Ubuntu only).

Installing Sphinx is much easier from Sphinxsearch PPA repository, because you will get all dependencies and can also update Sphinx to the latest version with the same command.

  1. First, add Sphinxsearch repository and update the list of packages:$ sudo add-apt-repository ppa:builds/sphinxsearch-rel22$ sudo apt-get update
  2. Install/update sphinxsearch package:$ sudo apt-get install sphinxsearch

Sphinx searchd daemon can be started/stopped using service command:

$ sudo service sphinxsearch start

For a complete Tutorial see: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-sphinx-on-ubuntu-16-04

Prerequisites

Before you begin this guide, you will need:

  • One Ubuntu 16.04 server.
  • A sudo non-root user, which you can set up by following this tutorial.
  • MySQL installed on your server, which you can set up by following the step 2 of this tutorial.

Step 1 — Installing Sphinx

Installing Sphinx on Ubuntu is easy because it’s in the native package repository. Install it using apt-get.

sudo apt-get install sphinxsearch

Now you have successfully installed Sphinx on your server. Before starting the Sphinx daemon, let’s configure it.

Sphinx’s configuration should be in a file called sphinx.conf in /etc/sphinxsearch. The configuration consists of 3 main blocks that are essential to run: index, searchd, and source. We’ll provide an example configuration file for you to use, and explain each section so you can customize it later.

First, create the sphinx.conf file.

  1. sudo nano /etc/sphinxsearch/sphinx.conf

Each of these index, searchd, and source blocks are described below. Then, at the end of this step, the entirety of sphinx.conf is included for you to copy and paste into the file.

source src1
{
  type			= mysql

  sql_host		= localhost
  sql_user		= root
  sql_pass		= your_root_mysql_password
  sql_db		= test
  sql_port		= 3306

  sql_query		= \
  SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
  FROM documents

  sql_attr_uint			= group_id
  sql_attr_timestamp	= date_added
}
index test1
{
  source			= src1
  path				= /var/lib/sphinxsearch/data/test1
  docinfo			= extern
}
searchd
{
  listen			= 9306:mysql41
  log				= /var/log/sphinxsearch/searchd.log
  query_log			= /var/log/sphinxsearch/query.log
  read_timeout		= 5
  max_children		= 30
  pid_file			= /var/run/sphinxsearch/searchd.pid
  seamless_rotate	= 1
  preopen_indexes	= 1
  unlink_old		= 1
  binlog_path		= /var/lib/sphinxsearch/data
}

To explore more configurations, you can take a look at the /etc/sphinxsearch/sphinx.conf.sample file, which has all the variables explained in even more detail.

2.5. Installing Sphinx on Windows

Installing Sphinx on a Windows server is often easier than installing on a Linux environment; unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads area on the website.

  1. Extract everything from the .zip file you have downloaded – sphinx-2.2.11-dev-win32.zip, or sphinx-2.2.11-dev-win32-pgsql.zip if you need PostgresSQL support as well. (We are using version 2.2.11-dev here for the sake of example only; be sure to change this to a specific version you’re using.) You can use Windows Explorer in Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive.For the remainder of this guide, we will assume that the folders are unzipped into C:\Sphinx, such that searchd.exe can be found in C:\Sphinx\bin\searchd.exe. If you decide to use any different location for the folders or configuration file, please change it accordingly.
  2. Edit the contents of sphinx.conf.in – specifically entries relating to @CONFDIR@ – to paths suitable for your system.
  3. Install the searchd system as a Windows service:C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config C:\Sphinx\sphinx.conf.in --servicename SphinxSearch
  4. The searchd service will now be listed in the Services panel within the Management Console, available from Administrative Tools. It will not have been started, as you will need to configure it and build your indexes with indexer before starting the service. A guide to do this can be found under Quick tour.During the next steps of the install (which involve running indexer pretty much as you would on Linux) you may find that you get an error relating to libmysql.dll not being found. If you have MySQL installed, you should find a copy of this library in your Windows directory, or sometimes in Windows\System32, or failing that in the MySQL core directories. If you do receive an error please copy libmysql.dll into the bin directory.

TEI Boilerplate (TEI anzeigen mit CSS und javascript)

https://github.com/TEI-Boilerplate/TEI-Boilerplate

 

#About TEI Boilerplate

TEI Boilerplate (http://teiboilerplate.org/) is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML. Our TEI Boilerplate Demo illustrates many TEI features rendered by TEI Boilerplate.

#Browser Compatibility

TEI Boilerplate requires a robust, modern browser to do its work. It is compatible with current versions of Firefox, Chrome, Safari, and Internet Explorer.

Note: For security reasons, some browsers (e.g., Chrome) will not process the XSLT transformation when the TEI document is opened from the local file system. Chrome does work fine when the TEI files are delivered through a Web server, including localhost.

If you have problems with TEI Boilerplate with a modern browser, please let us know by filing a bug report or feature request at http://github.com/GrantLS/TEI-Boilerplate/issues.

#Introduction

TEI is an XML-based language for describing and analyzing literary texts and other documents of interest to humanities scholars. Although TEI provides mechanisms for describing the design, presentational, and material features of the source document, projects and individual scholars that use TEI are responsible for developing their own methods, or implementing existing solutions, for converting the TEI to a presentation-ready state for the web or print (Rahtz, 2012). Two potential paths to reach this goal are:

  1. Transforming TEI to HTML using XSLT and styling the HTML output with CSS.
  2. Styling the TEI directly with CSS by referencing a CSS stylesheet from within the TEI document.

Both of these approaches have advantages and disadvantages. Although HTML is the language of the web and, as such, is well supported by browsers, HTML’s descriptive capabilities are much less expressive than TEI’s. When TEI is transformed to HTML, much of the richness of the TEI is lost or obscured in the resulting HTML. However, the browser understands HTML very well and knows, for example, when to initiate retrieval of a document based on certain user events, such as clicking a link. The second option, CSS-styled TEI, delivers the TEI document directly to the browser. However, while the browser may apply CSS to format and style a TEI document, the browser doesnot understand the semantics of TEI. For instance, the browser does not understand that TEI’s <ptr> and <ref> elements are linking elements.

TEI Boilerplate bridges the gap between these two approaches by making use of the built-in XSLT (1.0) capabilities of browsers to embed the TEI XML, with minimal modifications, within an HTML5 shell document. Features expected of web documents, such as clickable links and display of linked images, are enabled through selective transformation of a very small number of TEI elements and attributes. Both the HTML5 shell and the embedded TEI are styled using CSS.

TEI Boilerplate is not intended to be a replacement for the many excellent XSLT solutions for publishing and displaying TEI/XML on the web. It is intended to be a simple and lightweight alternative to more complex XSLT solutions. There are both practical and theoretical advantages to this lightweight approach.

#Using it in Your Project

Download the TEI Boilerplate files, and host the dist directory on a web server.

The simplest way to use TEI Boilerplate (TEIBP) is simply to add your TEI files to the dist/content directory of TEI Boilerplate and include the following xml-stylesheet processing instruction at the top of your TEI documents, after the XML declaration and before the root <TEI> element:

<?xml-stylesheet type="text/xsl" href="teibp.xsl"?>

You may then access your TEI files from a modern browser and see the resulting styled document.

TEI Critical Apparatus Toolbox

This page offers a graphical user interface for the customisation of an XSLT transformation from TEI XML to LateX (reledmac) and PDF. Although we tried to offer as generic a transformation as possible, further customisation might be necessary for your own flavour of encoding. You have the possibility of performing such advanced customisation at the bottom of this form, provided you know XSLT and LateX.

http://teicat.huma-num.fr/print.php

The TEI Critical Apparatus Toolbox is a tool for people preparing a natively digital TEI critical edition.

The Toolbox lets you

  • Check your encoding: offers facilities to display your edition while it is still in the making, and check the consistency of your encoding
  • Display parallel versions: choose the sigla of the witnesses, and the different versions of the text, following each chosen witness, will be displayed in parallel columns.
  • Print an edition of a TEI XML edition, with a TEI-to-LateX and PDF transformation
  • Annotate an image: lets you easily trace zones on an image to prepare a documentary edition
  • Get statistics on the XML tags effectively used in different parts of your edition, and some word count.

TEI Lex-0 — A baseline encoding for lexicographic data

TEI Lex-0 is both a technical specification and a set of community-based recommendations for encoding machine-readable dictionaries. It is rooted in the Guidelines of the Text Encoding Initiative (TEI) and delivered as a customization of the TEI schema.

Following the spirit of TEI Analytics, developed in the context of the MONK project (Zillig 2009), TEI Lex-0 aims at establishing a baseline encoding and a target format to facilitate the interoperability of heterogeneously encoded lexical resources. This is important both in the context of building lexical infrastructures as such (Ermolaev and Tasovac 2012) and in the context of developing generic TEI-aware tools such as dictionary viewers and profilers.

https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html

TEI ODD Element List


Element list for ODD

This is a brief reference sheet listing the most essential elements used in writing
a TEI ODD file.

Ordered by Function

High-level ODD Structures

schemaSpec
The element that contains the formal schema specifications within the ODD file. All
the schema customization elements listed below are children of schemaSpec.
ident
Used on schemaSpec indicates the name of the schema that will be created; required by roma.
moduleRef
A reference to a TEI module (i.e. a grouping of elements representing a single chapter
of the TEI Guidelines), by which the module is included in the custom schema.
key
Used on moduleRef, gives the short name of the module to be included
except
Used on moduleRef, lists which elements from the module which should not be included
include
Used on moduleRef, lists which elements from the module should be included (thus excluding all others)
mode
Used on almost all ODD elements, indicates the nature of the change being made in
relation to unmodified TEI:
  • add indicates that the feature in question originates with the customization and is additional to the unmodified TEI schema
  • delete indicates that the feature in question is being deleted from the custom schema
  • change indicates that the feature in question is being changed in some detail, and that
    the changed portion should override the corresponding portion of the unmodified TEI
    schema, leaving other portions unchanged
  • replace indicates that the feature in question is being completely changed, and that the
    portion designated in the custom schema should entirely replace the specification
    for that feature in the unmodified TEI

Element Management

elementSpec
A specification of a single element in the schema, which can be used to delete an
element from an included module (with mode=“delete“) or to modify some aspect of the element’s definition (with mode=“change“ or mode=“replace“).
module
Used with elementSpec, indicates the module in which the element in question is defined.
ident
Used with elementSpec, indicates the name of the element in question
altIdent
Used inside elementSpec to change the name of an element; indicates the new name by which the element being
specified will be known in the custom schema
classes
Used within elementSpec (or classSpec) to change the class membership of the element (or class). Contains one or more memberOf elements which, by means of their key attributes, indicate the specific classes to which the element (or class) is being
added or from which it is being deleted.
memberOf
Designates a specific model class to which an element is being added, or from which
it is being deleted.
key
used on memberOf, contains the name of the class being designated.
mode
used on memberOf, with value add, indicates that the element is being added to the class designated by the memberOf element.
content
Contains the content model for an element in RELAX NG (XML syntax) or Pure ODD elements,
for which see Content, below
constraintSpec
Contains a further syntactic constraint, typically expressed in ISO Schematron.

Content

elementRef
A reference to an element (typically a TEI element, but perhaps one you’ve added in
your customization), indicating that it is permitted or required at this point
key
used on elementRef, indicates the element being refered to.
minOccurs
indicates the smallest number of times the indicated element may occur at this point
(default is 1).
maxOccurs
indicates the largest number of times the indicated element may occur at this point
(default is unbounded, i.e. an infinite number of times).
classRef
A reference to an entire model class, all of whose members may (or must) occur at
this point
key
used on classRef, indicates the model class being referred to.
include
used on classRef, indicates which members of the model class are being referred to
except
used on classRef, indicates which members of the model class are not being referred to
textNode
Used to indicate that a string of zero or more characters is allowed at this point.
sequence
Indicates that each of the items referred to by the references inside must occur;
if preserveOrder=“true“ then furthermore they must occur in the order specified
alternate
Indicates that only one of the items referred to by the references inside must occur
(the number of times a selected alternate must occur is specified by minOccurs and maxOccurs

Attribute Management

attList
A list of attribute definitions; used to contain the attDef elements that are used to add, delete, or modify the attributes for an element.
attDef
Used to designate an attribute that is being added, deleted, or modified.
ident
Used on attDef, indicates the name of the attribute in question.
valList
A list of values for an attribute (used to add or alter a set of constrained values).
valItem
A single value in a value list for an attribute.
datatype
Contains a reference to a TEI datatype, used when constraining the contents of an
element or attribute value.
constraintSpec
Contains a further syntactic constraint, typically expressed in ISO Schematron.

Schema Documentation and Management

gloss
Used inside elementSpec, attDef, and valItem to provide the full natural-language equivalent of the name (i.e. ident) of the construct. E.g., the gloss for elementSpec ident=“att“ is attribute (as opposed to attention)
desc
Used inside elementSpec, attDef, and valItem to document the new or altered element, attribute, or attribute value.
exemplum
Contains an example of usage of the attribute (in attDef), class (in classSpec), element (in elementSpec), or macro (in macroSpec) along with optional commentary
remarks
Contains commentary or discussion about usage of the attribute (in attDef), class (in classSpec), element (in elementSpec), or macro (in macroSpec)
classSpec
A specification of a TEI class (either a model class or an attribute class); often
used (with a mode of change) to change the membership of an attribute class
macroSpec
A specification of a TEI datatype or a chunk of arbitrary RELAX NG code

Processing Model Documentation

model
Documents potential processing
for a specified element
behaviour
Names the process or
function which this processing model uses in order to produce
output
predicate
The XPath predicate
expression giving the condition under which this model applies
useSourceRendition
Whether to obey
any rendition attribute which is present (by default no)
outputRendition
Description of the
rendering of this element (in selected context)
modelGrp
A grouping of
model elements with common output
modelSequence
A sequence of
model elements intended as a single set of actions

Copyleft 2010 Syd Bauman and Julia
Flanders; source available at
http://www.wwp.neu.edu/outreach/seminars/_current/handouts/elementList_odd.tei.

TEI Publisher workshop

Einen online workshop unter dem Titel „Stay home and Learn TEI Publisher From Scratch“ bietet Wolfgang Meier an. Die Materialien finden sich unter https://github.com/eeditiones/workshop .

Die Slides der ersten Sitzungen sind hier verfügbar:

Videos der ersten Sitzung:

Videos der zweiten Sitzung

Videos der dritten Sitzung