A B C D E F G H I J K L M N O P Q R S T U V W X Z

SBOM (Software Bill Of Materials) for Docker

https://docs.docker.com/engine/sbom/

A Software Bill Of Materials (SBOM) is analogous to a packing list for a shipment. It lists all the components that make up the software, or were used to build it. For container images, this includes the operating system packages that are installed (for example, ca-certificates) along with language-specific packages that the software depends on (for example, Log4j). The SBOM could include a subset of this information or even more details, like the versions of components and their source.

It is available as a plugin with the latest docker desktop software.

Solr Cell & schemaless indexing with Apache Tika

Key Solr Cell Concepts

When using the Solr Cell framework, it is helpful to keep the following in mind:

  • Tika will automatically attempt to determine the input document type (e.g., Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the stream.type parameter. See http://tika.apache.org/1.24.1/formats.html for the file types supported.
  • Briefly, Tika internally works by synthesizing an XHTML document from the core content of the parsed document which is passed to a configured SAX ContentHandler provided by Solr Cell. Solr responds to Tika’s SAX events to create one or more text fields from the content. Tika exposes document metadata as well (apart from the XHTML).
  • Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. The metadata available is highly dependent on the file types and what they in turn contain. Some of the general metadata created is described in the section Metadata Created by Tika below. Solr Cell supplies some metadata of its own too.
  • Solr Cell concatenates text from the internal XHTML into a content field. You can configure which elements should be included/ignored, and which should map to another field.
  • Solr Cell maps each piece of metadata onto a field. By default it maps to the same name but several parameters control how this is done.
  • When Solr Cell finishes creating the internal SolrInputDocument, the rest of the Lucene/Solr indexing stack takes over. The next step after any update handler is the Update Request Processor chain.

Solr Cell is a contrib, which means it’s not automatically included with Solr but must be configured. The example configsets have Solr Cell configured, but if you are not using those, you will want to pay attention to the section Configuring the ExtractingRequestHandler in solrconfig.xml below.

For more information see:

https://solr.apache.org/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html

SolrWayback

SolrWayback

GitHub repo: https://github.com/netarchivesuite/solrwayback

About SolrWayback

SolrWayback is a web application for browsing historical harvested ARC/WARC files similar to the Internet Archive Wayback Machine. SolrWayback runs on a Solr server containing ARC/WARC files indexed using the warc-indexer.

SolrWayback bundle release 4.4.2 can be downloaded here: https://github.com/netarchivesuite/solrwayback/releases/tag/4.4.2

The bundle is the recommended way to get started with SolrWayback. You download the bundle, follow the installation guide and index your own WARC files. Then you are up to speed.

 

## Install instructions from ReadMe file once SolrWayback bundle has been downloaded.

### 1) INITIAL SETUP
Properties:
Copy the two files `properties/solrwayback.properties` and `properties/solrwaybackweb.properties` to your HOME folder (or the home-folder for Tomcat user)

Optional: For screenshot previews to work you may have to edit `solrwayback.properties` and change the value of the last two properties : `chrome.command` and `screenshot.temp.imagedir`.
Chrome(Chromium) must has to be installed for screenshot preview images.

If there are errors when running a script, try change the permissions for the file (`startup.sh` etc). Linux: `chmod +x filename.sh`

### 2) STARTING SOLRWAYBACK
SolrWayback requires both Solr and Tomcat to be running.

#### Tomcat:

* Start tomcat: `apache-tomcat-8.5.60/bin/startup.sh`
* Stop tomcat: `apache-tomcat-8.5.60/bin/shutdown.sh`
* (For windows navigate to `apache-tomcat-8.5.60/bin/` and type `startup.bat` or `shutdown.bat`)
* To see Tomcat is running open: http://localhost:8080/solrwayback/

#### Solr:
* Start solr: `solr-7.7.3/bin/solr start`
* Stop solr: `solr-7.7.3/bin/solr stop -all`
* (For windows navigate to `solr-7.7.3/bin/` and type `solr.cmd start` or `solr.cmd stop -all`)
* To see Solr is running open: http://localhost:8983/solr/#/netarchivebuilder

### 3) INDEXING
SolrWayback uses a Solr index of WARC files to support freetext search and more complex queries.
If you do not have existing WARC files, see steps below on harvesting with wget.

The script `warc-indexer.sh` in the `indexing`-folder allows for multi processing and keeps track of already
indexed files, so the collection can be extended by adding more WARCs and running the script again.

Call `indexing/warc-indexer.sh -h` for usage and how to adjust the number of processes to use for indexing.
Example usage that will index all WARC-files in the warcs1 folder.
„`
THREADS=2 ./warc-indexer.sh warcs1/*
„`

This will start indexing files from the warcs1 folder using 2 threads. Assigning a higher number of threads than CPU cores available will result in slower indexing. Each indexing job require 1GB ram, so this can also be a limiting factor.

You can also populate the collection and collectionid field in Solr with custom values:
„`
THREADS=4 INDEXER_CUSTOM=“–collection_id collection1 –collection corona2021″ ./warc-indexer.sh warcs1/*
„`

You can then enable facetting on these fields in solrwaybackweb.properties.

 

The script keeps track of processed files by checking if a log from a previous analysis is available. The logs are stored
in the `status`-folder (this can be changed using the `STATUS_ROOT` variable). To re-index a WARC file, delete the
corresponding log file.

The script `warc-indexer.sh` is not available for Windows. For windows platform only a more primitive script is provided that also works for Linux/MacOs.
1. Copy ARC/WARC files into folder: `indexing/warcs1`
2. Start indexing: call `indexing/batch_warcs1_folder.sh` (or batch_warcs1_folder.bat for windows)

Indexing can take up to 20 minutes for 1GB warc-files. After indexing, the warc-files must stay in the same folder since SolrWayback is using them during playback etc.

Having whitespace characters in WARC file names can result in pagepreviews and playback not working on some systems.
There can be up to 5 minutes delay before the indexed files are visible from search. Visit this url after index job have finished to commit them instantly: http://localhost:8983/solr/netarchivebuilder/update?commit=true
There is a batch_warcs2_folder.sh similar script to show how to easily add new WARC files to the collection without indexing the old ones again.

For more information about the warc-indexer see: https://github.com/ukwa/webarchive-discovery/wiki/Quick-Start

Specification for an OAI Static Repository and an OAI Static Repository Gateway

 

http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm

 

A Static Repository, introduced here, provides a simple approach for exposing relatively static and small collections of metadata records through the OAI-PMH. The Static Repository approach is targeted at organizations that:

  • Have metadata collections ranging in size between 1 and 5000 records;
  • Can make static content available through a network-accessible Web server;
  • Need a technically simpler implementation strategy compared to acting as an OAI-PMH Repository, which requires processing OAI-PMH requests.

A Static Repository is an XML file that is made accessible at a persistent HTTP URL. The XML file contains metadata records and repository information.

A Static Repository becomes accessible via OAI-PMH through the intermediation of one Static Repository Gateway. The restriction that only one Static Repository Gateway acts as an intermediary for each Static Repository reduces potential problems with large-scale duplication of metadata records among OAI-PMH repositories. A Static Repository Gateway uses the metadata records and repository information, provided via XML in the Static Repository, to respond to the six OAI-PMH requests for access to that information. Because a Static Repository Gateway maps a unique Static Repository base URL to each such Static Repository, harvesters can access a Static Repository in exactly the same manner as they access any other OAI-PMH Repository.

The relationship between Static Repositories, a Static Repository Gateway, and an OAI-PMH harvester is illustrated in the figure below. The Static Repository and the Static Repository Gateway are described in the remainder of this document. Implementers whose sole interest is the creation of a Static Repository may skip Section 4 that describes the Static Repository Gateway.

SphinxSE MySQL Full Text Search

SphinxSE Full Text Search engine:

http://sphinxsearch.com/docs/current.html#about

Sphinx is a full-text search engine, publicly distributed under GPL version 2. Commercial licensing (eg. for embedded use) is available upon request.

Technically, Sphinx is a standalone software package provides fast and relevant full-text search functionality to client applications. It was specially designed to integrate well with SQL databases storing the data, and to be easily accessed by scripting languages. However, Sphinx does not depend on nor require any specific database to function.

Applications can access Sphinx search daemon (searchd) using any of the three different access methods: a) via Sphinx own implementation of MySQL network protocol (using a small SQL subset called SphinxQL, this is recommended way), b) via native search API (SphinxAPI) or c) via MySQL server with a pluggable storage engine (SphinxSE).

Official native SphinxAPI implementations for PHP, Perl, Python, Ruby and Java are included within the distribution package. API is very lightweight so porting it to a new language is known to take a few hours or days. Third party API ports and plugins exist for Perl, C#, Haskell, Ruby-on-Rails, and possibly other languages and frameworks.

Starting from version 1.10-beta, Sphinx supports two different indexing backends: „disk“ index backend, and „realtime“ (RT) index backend. Disk indexes support online full-text index rebuilds, but online updates can only be done on non-text (attribute) data. RT indexes additionally allow for online full-text index updates. Previous versions only supported disk indexes.

Data can be loaded into disk indexes using a so-called data source. Built-in sources can fetch data directly from MySQL, PostgreSQL, MSSQL, ODBC compliant database (Oracle, etc) or a pipe in TSV or a custom XML format. Adding new data sources drivers (eg. to natively support other DBMSes) is designed to be as easy as possible. RT indexes, as of 1.10-beta, can only be populated using SphinxQL.

http://www.sphinxsearch.com/wiki/doku.php?id=tutorials

2.3. Installing Sphinx packages on Debian and Ubuntu

There are two ways of getting Sphinx for Ubuntu: regular deb packages and the Launchpad PPA repository.

Deb packages:

  1. Sphinx requires a few libraries to be installed on Debian/Ubuntu. Use apt-get to download and install these dependencies:$ sudo apt-get install mysql-client unixodbc libpq5
  2. Now you can install Sphinx:$ sudo dpkg -i sphinxsearch_2.2.11-dev-0ubuntu12~trusty_amd64.deb

PPA repository (Ubuntu only).

Installing Sphinx is much easier from Sphinxsearch PPA repository, because you will get all dependencies and can also update Sphinx to the latest version with the same command.

  1. First, add Sphinxsearch repository and update the list of packages:$ sudo add-apt-repository ppa:builds/sphinxsearch-rel22$ sudo apt-get update
  2. Install/update sphinxsearch package:$ sudo apt-get install sphinxsearch

Sphinx searchd daemon can be started/stopped using service command:

$ sudo service sphinxsearch start

For a complete Tutorial see: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-sphinx-on-ubuntu-16-04

Prerequisites

Before you begin this guide, you will need:

  • One Ubuntu 16.04 server.
  • A sudo non-root user, which you can set up by following this tutorial.
  • MySQL installed on your server, which you can set up by following the step 2 of this tutorial.

Step 1 — Installing Sphinx

Installing Sphinx on Ubuntu is easy because it’s in the native package repository. Install it using apt-get.

sudo apt-get install sphinxsearch

Now you have successfully installed Sphinx on your server. Before starting the Sphinx daemon, let’s configure it.

Sphinx’s configuration should be in a file called sphinx.conf in /etc/sphinxsearch. The configuration consists of 3 main blocks that are essential to run: index, searchd, and source. We’ll provide an example configuration file for you to use, and explain each section so you can customize it later.

First, create the sphinx.conf file.

  1. sudo nano /etc/sphinxsearch/sphinx.conf

Each of these index, searchd, and source blocks are described below. Then, at the end of this step, the entirety of sphinx.conf is included for you to copy and paste into the file.

source src1
{
  type			= mysql

  sql_host		= localhost
  sql_user		= root
  sql_pass		= your_root_mysql_password
  sql_db		= test
  sql_port		= 3306

  sql_query		= \
  SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \
  FROM documents

  sql_attr_uint			= group_id
  sql_attr_timestamp	= date_added
}
index test1
{
  source			= src1
  path				= /var/lib/sphinxsearch/data/test1
  docinfo			= extern
}
searchd
{
  listen			= 9306:mysql41
  log				= /var/log/sphinxsearch/searchd.log
  query_log			= /var/log/sphinxsearch/query.log
  read_timeout		= 5
  max_children		= 30
  pid_file			= /var/run/sphinxsearch/searchd.pid
  seamless_rotate	= 1
  preopen_indexes	= 1
  unlink_old		= 1
  binlog_path		= /var/lib/sphinxsearch/data
}

To explore more configurations, you can take a look at the /etc/sphinxsearch/sphinx.conf.sample file, which has all the variables explained in even more detail.

2.5. Installing Sphinx on Windows

Installing Sphinx on a Windows server is often easier than installing on a Linux environment; unless you are preparing code patches, you can use the pre-compiled binary files from the Downloads area on the website.

  1. Extract everything from the .zip file you have downloaded – sphinx-2.2.11-dev-win32.zip, or sphinx-2.2.11-dev-win32-pgsql.zip if you need PostgresSQL support as well. (We are using version 2.2.11-dev here for the sake of example only; be sure to change this to a specific version you’re using.) You can use Windows Explorer in Windows XP and up to extract the files, or a freeware package like 7Zip to open the archive.For the remainder of this guide, we will assume that the folders are unzipped into C:\Sphinx, such that searchd.exe can be found in C:\Sphinx\bin\searchd.exe. If you decide to use any different location for the folders or configuration file, please change it accordingly.
  2. Edit the contents of sphinx.conf.in – specifically entries relating to @CONFDIR@ – to paths suitable for your system.
  3. Install the searchd system as a Windows service:C:\Sphinx\bin> C:\Sphinx\bin\searchd --install --config C:\Sphinx\sphinx.conf.in --servicename SphinxSearch
  4. The searchd service will now be listed in the Services panel within the Management Console, available from Administrative Tools. It will not have been started, as you will need to configure it and build your indexes with indexer before starting the service. A guide to do this can be found under Quick tour.During the next steps of the install (which involve running indexer pretty much as you would on Linux) you may find that you get an error relating to libmysql.dll not being found. If you have MySQL installed, you should find a copy of this library in your Windows directory, or sometimes in Windows\System32, or failing that in the MySQL core directories. If you do receive an error please copy libmysql.dll into the bin directory.