Leipzig Corpus Miner (LCM)
The iLCM is not a stand-alone software, but rather an infrastructure consisting of a multitude of components including a document database (MariaDB), an NLP pipeline for processing different text mining processes (In R statistical language), a full-text index (Solr) and a web application (R Shiny). To make the infrastructure available as a decentralized installation for other projects it is embedded in a virtual machine ensemble (Docker), which can be easily set up with predefined configuration scripts. The application is therefor a fusion of R scripting capabilities, Data Management and visualization by R Shiny. By using a ORC approach, the documentation of the data-processing happens on the fly, with always having the data, the used scripts and their description available in a “notebook” for later consideration.
LombardPress tools – framwork to edit historical texts in TEI
https://lombardpress.org/applications-publications/
About LombardPress
LombardPress Schema
The LombardPress Schema is a narrow customization of the more general TEI specification. This customization allows us to document and encode the increasingly complex interaction of works, transcriptions, translations, and images that constitute the corpus of scholastic texts. For more, see Lbp Schema
LombardPress Web Publication Framework
The LombardPress-Web is a web application designed to allow users to read and study digitally encoded texts, particularly those edited according to the LombardPress Schema and ingested in the Scholastic Commentaries and Text Archive. For more see Lbp Web
LombardPress Print
LombardPress Print is a set of stylesheets, scripts, and workflows designed to allow for the easy typesetting of complexly encoded critical editions in LaTeX and ultimately the transformation of these texts into camera ready proofs. For more see Lbp Print
Pellego
Pellego is a simple text viewer, designed to consume the SCTA simple presentation API. Pellego can function as a standalone app or easily embedded within in any website. View a demo here: http://lombardpress.org/pellego/ or download/fork/clone on github at https://github.com/lombardpress/pellego
Ad fontes
Ad fontes is specialized app for exploring quotation through the scholastic corpus. This application published at http://lombardpress.org/adfontes/
Contact
The primary contact for the LombardPress system is Jeffrey C Witt (jcwitt [at] loyola [dot] edu).
LZA – Langzeitarchivierung
Standards zur Erfassung von Metadaten für die Langzeitarchivierung:
Manticor Search (früher Sphinx)
Manticore Search is an open-source database that was created in 2017 as a continuation of Sphinx Search engine. We took all the best from it, significantly improved its functionality, fixed hundreds of bugs, rewrote the code almost completely and kept it open-source! That all has made Manticore Search a modern, fast, light-weight and full-featured database with outstanding full-text search capabilities.
https://manticoresearch.com/about/
We love SQL. There can’t be anything simpler to use when you are just preparing your search query: WHERE
, GROUP BY
, ORDER BY
– most developers are used to these things since they’ve been in use for decades. You can use SQL to do any kind of query in Manticore Search. At the same time we understand that when it comes to coding your queries in your application it may be easier to use more structured protocols than an SQL string. That’s why Manticore Search also speaks JSON. We also maintain Manticore Search bindings for various programming languages to make integration even easier.
Manticore Search, being a purely full-text search engine initially has outstanding full-text capabilities: over 20 full-text operators and more than 20 ranking factors, various built-in rankers and an expression-based custom ranker, text stemming, lemmatization, stopwords, synonyms, wordforms, low-level characters mapping, proper Chinese segmentation, easy text highlighting, ranking and tokenization plugins and many more.
MiMoText – Mining and Modeling Text – with WikiBase
See tutorial at: https://mimotext.github.io
MySQL
https://support.infrasightlabs.com/article/host-is-not-allowed-to-connect-to-this-mysql-server/
Zugang über 3306 von ausserhalb des Servers:
Externen MySQL Zugriff in der my.cnf aktivieren:
Der MySQL Server lauscht standardmäßig nur auf der localhost IP Adresse (127.0.0.1). Folgender Eintrag in der my.cnf ist dafür verantwortlich (/etc/my.cnf oder /etc/mysql/my.cnf):
1
|
bind-address = 127.0.0.1 |
Um nun auch von anderen Rechnern auf den MySQL Server zugreifen zu können, wird der „bind-address“ Eintrag geändert. 0.0.0.0 weist an, dass der MySQL Server auf alle, für ihn verfügbaren, IP Adressen lauscht:
1
|
bind-address = 0.0.0.0 |
Mit „bind-address = 192.168.200.1“ kann der Server zB. auch konfiguriert werden, dass er nur über eine bestimmte IP Adresse verfügbar ist.
MySQL Server neu starten, damit die Konfiguration übernommen wird:
1
|
/etc/init .d /mysql restart |
Verbindung überprüfen (mit einem telnet auf die IP Adresse/Hostname des Server und den MySQL Port 3306 kann überprüft werden ob der MySQL Server korrekt antwortet):
1
|
telnet 192.168.200.1 3306 |
User für den externen Zugriff auf die MySQL Datenbank erlauben
Der MySQL Server lässt nun grundsätzlich Verbindungen von anderen (externen) IP Adressen zu, aber die Datenbank User müssen noch die entsprechende Berechtigung erhalten.
Auf die MySQL Konsole verbinden:
1
|
mysql -u root -p |
folgende Befehle geben dem User (dein_user) die Berechtigung von jedem Host aus (%) auf die Datenbank zuzugreifen:
1
2
3
|
use mysql; update user set host= '%' where user= 'dein_user' ; update db set host= '%' where user= 'dein_user' ; |
Neuen MySQL Benutzer anlegen, damit externer Zugriff besteht (hier wird ebenfalls das „%“ statt „localhost“ verwendet:
1
|
create user 'dein_user' @ '%' ; |
MySQL Zugriff für bestimmte IP Adressen erlauben (MySQL IP Restriction) via iptables
Den MySQL Zugriff kann man auch für bestimmte IP Adressen erlauben oder sperren. Hierzu behelfen wir uns anhand der Linux Firewall iptables. Unter Debian und den meisten Linux Distributionen ist iptables bereits vorinstalliert.
In folgender Regel wird definiert, dass auch der localhost weiterhin auf den MySQL Dienst zugreifen darf:
1
|
iptables -A INPUT -i lo -p tcp --dport 3306 -j ACCEPT |
Mit folgenden Regeln wird der Zugriff (Port 3306) für die IP Adressen 10.27.0.80 und 192.168.0.90 erlaubt und für alle anderen gesperrt:
1
2
3
|
iptables -A INPUT -p tcp --dport 3306 -s 10.27.0.80 -j ACCEPT iptables -A INPUT -p tcp --dport 3306 -s 192.168.0.90 -j ACCEPT iptables -A INPUT -p tcp --dport 3306 -j REJECT --reject-with icmp-port-unreachable |
Wer sich nun fragt warum beim Sperrbefehl der Einsatz von REJECT angewandt wird und nicht von DROP: Das hat den Grund, dass mit dem REJECT die Verbindung schneller abgelehnt wird. Bei DROP wird die Verbindung abgelehnt und der Client muss auf einen Timeout warten.
Die iptables Regeln müssen nun noch dauerhaft gespeichert werden, das sie sonst nach einem Reboot wieder vom System entfernt werden. Anleitung zu: Iptables dauerhaft speichern
Am einfachsten funktioniert das über das Programm „iptables-persistent“
1
|
apt-get install iptables-persistent |
Bei der Installation wird gefragt ob die Regel gleich in die Konfigurationsdateien übernommen werden sollen. Das kann gleich gemacht werden. Ansonsten befinden sich die Configfiles unter /etc/iptables.
Dockerize MySQL
A simple MySql database can be setup in a docker container. Mysql offers an official image that can be used for this.
A MySQL container can be created either using a Dockerfile, or using docker-compose. A basic docker-compose file can look like this:
services: mysql_container: image: mysql container_name: mysql_container command: --default-authentication-plugin=mysql_native_password environment: MYSQL_ROOT_PASSWORD: your_root_password volumes: - data:/var/lib/mysql volumes: data:
How does it work?
image: mysql = When creating a container, use the image „mysql“ from dockerhub
container_name: mysql_container = the name under which the container can be addressed once it is created
command:… and environment = sets up basic authentification. You can access the database using the user „root“ and the password given under „MYSQL_ROOT_PASSWORD“ to login once the container is started.
volumes: At the bottom, a volume called „data“ is created in you project folder. This volume is mounted onto the docker-container. Data in this folder will be persistent even if the container is stopped.
Source: https://hub.docker.com/_/mysql (Official MySQL image)
Importing csv data within a docker-container
To import csv data inside a docker-container, there are three basic steps:
-
- Make sure csv-data has the right format
- Use docker cp to copy the data into the container
- Use LOAD DATA INFILE to load data into tables
The correct format
Since .csv-Data requires each line to be terminated by a newline-character, we have to pay attention to the operating system on which the data was created. This is because windows uses a different newline-character (\r) than linux (\n). Since the mysql docker-container is based on linux, the data has to be formatted as if it were created on a linux machine.
The simplest way to achieve this is to open the data in excel/libreOffice/OnlyOffice or a similar application on a linux machine, and then save it as .csv from there.
Copying data into the docker-container
Once the data has the right format, we can copy the data into a folder within the mysql-container. Therefore we first have to login into the docker-container and create a folder where the files should go. With the setup above this could be the folder:
/var/lib/mysql/import_data
Attention! To use this folder from mysql you may have to add another command to your docker-compose-file, depending on your configuration:
- command: [...your commands...] --secure-file-priv=/var/lib/mysql/
Now you can use the docker cp command to copy the data from your system into the docker container:
[sudo] docker cp /path/on/your/machine mysql:/var/lib/mysql/import_data
Source: https://docs.docker.com/engine/reference/commandline/cp/
Load data in mysql
Now we have to load the data within mysql. Therefore we first have to enter our container’s shell with
[sudo] docker exec -it mysql bash
Afterwards we can enter the mysql-shell with this command:
mysql -u username -p
It will ask you for your password. After that, use the database where the data should be loaded and then this statement:
LOAD DATA INFILE '/var/lib/mysql/import_data/filename' INTO TABLE tablename FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
Source: https://www.mysqltutorial.org/import-csv-file-mysql-table/ (Also further explains the statement)
Named Entity Recognition with Entity Fishing tool (Grobid NERD)…
https://nerd.readthedocs.io/en/latest/overview.html
Researchers in Digital Humanities and in Social Sciences are often first of all interested in the identification and resolution of so-called named entities, e.g. person names, places, events, dates, organisation, etc. Entities can be known in advance and present in generalist or specialized knowledge bases. They can also be created based on open nomenclatures and vocabularies and impossible to enumerate in advance.
The entity-fishing services try to automate this recognition and disambiguisation task in a generic manner, avoiding as much as possible restrictions of domains, limitations to certain classes of entities or to particular usages.
Tasks
entity-fishing performs the following tasks:
- entity recognition and disambiguation against Wikidata in a raw text, partially-annotated text segment,
- entity recognition and disambiguation against Wikidata at document level, for example a PDF with layout positioning and structure-aware annotations etc.
- search query disambiguation (the short text mode) – below disambiguation of the search query “concrete pump sensor” in the service test console,
- weighted term vector disambiguation (a term being a phrase),
- interactive disambiguation in text editing mode.
Supervised machine learning is used for the disambiguation, based on Random Forest and Gradient Tree Boosting exploiting various features, including word and entity embeddings. Training is realized exploiting Wikipedia, which offers for each language a wealth of usage data about entity mentions in context. Results include in particular Wikidata identifiers and, optionally, statements.
The API also offers the possibility to apply filters based on Wikidata properties and values, allowing to create specialised entity identification and extraction (e.g. extract only taxon entities or only medical entities in a document) relying on the current 37M entities and 154M statements present in Wikidata.
The tool currently supports English, German, French, Spanish and Italian languages (more to come!). For English and French, a Name Entity Recognition based on CRF grobid-ner is used in combination with the disambiguation. For each recognized entity in one language, it is possible to complement the result with crosslingual information in the other languages. A nbest mode is available. Domain information are produced for a large amount of entities in the technical and scientific fields, together with Wikipedia categories and confidence scores.
The tool is developed in Java and has been designed for fast processing (at least for a NERD system, 500-1000 words per second on a medium-profile linux server single thread or one PDF page of a scientific articles in 1-2 seconds), with limited memory (at least for a NERD system, here 3GB of RAM) and to offer relatively close to state-of-the-art accuracy (more to come!). A search query can be disambiguated in 1-10 milliseconds. entity-fishing uses the very fast SMILE ML library for machine learning and a JNI integration of LMDB as embedded database.
The project can be found on github:
Neo4j
Neo4j is a Graphdatabase that can be queried using the query language CYPHER. It can be used for multiple kind of projects requiring connected datasets or graph operations.
Advantages
-
- Comes with a free community version that can do a lot of things
- CYPHER is intuitive and easy to learn
- Comes with multiple options to view, manipulate and export graphs
- Comes with a desktop version for local usage
Disadvantages
-
- Community version does not include access management for single users
- Documentation is huge but not always helpful
Using neo4j with docker
neo4j offers an official docker image here: https://hub.docker.com/_/neo4j
With this image, neo4j can be configured in docker-compose:
services: neo4j: image: neo4j container_name: neo4j ports: - "7687:7687" - "7474:7474" - "7473:7473" volumes: - ../neo4j/data:data - ../neo4j/logs:logs - ../neo4j/import:/var/lib/neo4j/import environment: - NEO4j_dbms_connector_bolt_listen__address=0.0.0.0:7687 volumes: neo4j:
Explaination:
image: Uses the image called neo4j from Dockerhub
container_name: The container can be referenced by this name, which makes it easier to maintain
ports: neo4j uses 3 ports in the container that need to be mapped to the localhost. On port 7687 other apps (e.g. a custom web app) can connect using the BOLT-Protocol. Port 7474 enables http communication and also offers a graphical browser for the database. Port 7373 is for https connection.
volumes: within the container there will be various folders some of which we need to access to use the database. Also we want to sustain data even if the container shuts down. We do this by declaring a volume called „neo4j“ on the bottom. When first executing, this will create a volume called „neo4j“ on the local machine. This volume will then be mounted in the docker container, so that the container can use data from this volume. This is especially important for the import folder, since this is how data can be imported to neo4j.
environment: This is how to configure neo4j on setup. All variables from here will be written to the conf/neo4j.conf file on the docker container. A list of all possible configurations can be found here. Normally, a parameter looks something like this: dbms.connector.bolt.listen_address. However, within docker-compose every . has to be exchanged with a _ and every underscore _ has to be typed as two underscores __. Also you have to prefix the name with NEO4J_. The example thus becomes NEO4J_dbms_connector_bolt_listen__address. More information on this can be found here. Note that the settings given in the docker-compose file will overwrite any default settings.
Remote connection to neo4j
When using neo4j as the database for a webapp, it will most likely run on a server and require remote access. This does not work by default, there are some settings to be made using the environment section in docker-compose:
NEO4J_dbms_connector_bolt_listen__address=0.0.0.0:7687 NEO4J_dbms_default__listen__address=0.0.0.0 NEO4J_dbms_default__advertised__address=0.0.0.0 NEO4J_dbms_connector_bolt_advertised__address=[Name of server!]
Please do not blindly copy and paste these settings, but do some research on what is needed for your project! However, this is what got ours to work.