Web Archiving

Below are the various web crawlers tested in an attempt to adequately crawl NADW sites for web archiving purposes.

Heritrix – https://heritrix.readthedocs.io/en/latest/getting-started.html#installation

Heritrix Documentation – Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations. https://heritrix.readthedocs.io/en/latest/


Archiveweb.page – https://archiveweb.page/

Archiveweb.page Documentation – ArchiveWeb.page is the latest tool from Webrecorder to turn your browser into a full-featured interactive web archiving system! ArchiveWeb.page is available as an extension for any Chrome or Chromium based browsers. (A standalone app version is also in development.) NB: Function as a Chrome Extension. It seems that once you launch the app, you have to manually click through each link and it records your progress as a WARC file.


WARCreate – https://warcreate.com/

WARCreate Documentation – WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive’s open source Wayback Machine. The tool is an evolving product with the end result pushing toward being a personal web archiving solution for those that wish to securely archive their metadata in a standardize way. NB: Functions as a Chrome extension. Once launched, it captures whatever webpage you are currently on as a WARC file.



grab-site Documentation – The archivist’s web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. NB: only works with specific versions of Python: 3.7 or 3.8.



ArchiveBox Documentation – ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline. You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows (WSL/Docker). You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list. The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down. NB: At first glance, this seemed like the best option as it’s super easy to use. However, it only crawls one layer deep. Not enough for our purposes.


Browsertrix Crawler – https://github.com/webrecorder/browsertrix-crawler

Browsertrix Documentation – Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. NB: As of 23 Oct, I have not yet successfully set this up and run it. I keep running into issues with Docker.


WebLicht ist eine online-Anwendung zur automatischen Annotion von Textcorpora.

Beschreibung von der WebLicht-Website:

WebLicht is an execution environment for automatic annotation of text corpora. Linguistic tools such as tokenizers, part of speech taggers, and parsers are encapsulated as web services, which can be combined by the user into custom processing chains. The resulting annotations can then be visualized in an appropriate way, such as in a table or tree format.

Link zum Tool: https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page

Wie viel Platz ist noch auf einem Linux Volume frei? | No more disk space: How can I find what is taking up the space?

As always in Linux, there’s more than one way to get the job done. However, if you need to do it from CLI, this is my preferred method:

I start by running this as root or with sudo:

du -cha --max-depth=1 / | grep -E "M|G"

The grep is to limit the returning lines to those which return with values in the Megabyte or Gigabyte range. If your disks are big enough, you could add |T as well to include Terabyte amounts. You may get some errors on /proc, /sys, and/or /dev since they are not real files on disk. However, it should still provide valid output for the rest of the directories in root. After you find the biggest ones you can then run the command inside of that directory in order to narrow your way down the culprit. So for example, if /var was the biggest you could do it like this next:

du -cha --max-depth=1 /var | grep -E "M|G"

Gesamter Thread unter: https://askubuntu.com/questions/911865/no-more-disk-space-how-can-i-find-what-is-taking-up-the-space


Wikibase ist eine Open-SourceGraphdatenbank, die für die Wissensdatenbank Wikidata entwickelt wurde. Das Datenbankmanagementsystem besteht aus einer Sammlung von Erweiterungen für die MediaWiki-Software. Wikibase steht unter der GPL-Lizenz frei und kostenlos zur Verfügung. Zu den Besonderheiten von Wikibase gehören ein eigenes Datenmodell, Versionierung und Mehrsprachigkeit. Zum Zugriff auf Wikibase gibt es mehrere Programmierschnittstellen und Client-Programme. Das Datenmodell einer Wikibase-Instanz wird auf das Resource Description Framework gemappt, so dass die Datenbasis auch per SPARQL abgefragt werden kann. Neben Wikidata wird Wikibase vor allem im Wissenschafts- und Kulturbereich eingesetzt.

Die DNB testet gerade wikibase für die GND und dazu gibt es einen interessanten Vortrag:

Word Dokumente in mehrere Teile aufteilen und speichern

Auf dieser Seite gibt es VBA Code, mit dem man Word Dokumente in unterkapitel etc. zerschneiden und speichern kann. Funktioniert mit Word 2016:


Dieser Code funktioniert z.B. mit dem Trennzeichen „///“

Sub SplitNotes(delim As String, strFilename As String)
Dim doc As Document
Dim arrNotes
Dim I As Long
Dim X As Long
Dim Response As Integer
arrNotes = Split(ActiveDocument.Range, delim)
Response = MsgBox(„This will split the document into “ & UBound(arrNotes) + 1 & “ sections.Do you wish to proceed?“, 4)
If Response = 7 Then Exit Sub
For I = LBound(arrNotes) To UBound(arrNotes)
If Trim(arrNotes(I)) <> „“ Then
X = X + 1
Set doc = Documents.Add
doc.Range = arrNotes(I)
doc.SaveAs ThisDocument.Path & „\“ & strFilename & Format(X, „000“)
doc.Close True
End If
Next I
End Sub
Sub test()
‚delimiter & filename
SplitNotes „///“, „Notes “
End Sub

WP Offload Media Lite for Amazon S3, DigitalOcean Spaces, and Google Cloud Storage

This plugin automatically copies images, videos, documents, and any other media added through WordPress‘ media uploader to Amazon S3, DigitalOcean Spaces or Google Cloud Storage. It then automatically replaces the URL to each media file with their respective Amazon S3, DigitalOcean Spaces or Google Cloud Storage URL or, if you have configured Amazon CloudFront or another CDN with or without a custom domain, that URL instead. Image thumbnails are also copied to the bucket and delivered through the correct remote URL.

Uploading files directly to your Amazon S3, DigitalOcean Spaces or Google Cloud Storage account is not currently supported by this plugin. They are uploaded to your server first, then copied to the bucket. There is an option to automatically remove the files from your server once they are copied to the bucket however.

WP Offload Media Lite for Amazon S3, DigitalOcean Spaces, and Google Cloud Storage