Web Archiving

Below are the various web crawlers tested in an attempt to adequately crawl NADW sites for web archiving purposes.

Heritrix – https://heritrix.readthedocs.io/en/latest/getting-started.html#installation

Heritrix Documentation – Heritrix is the Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project. Heritrix seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations. https://heritrix.readthedocs.io/en/latest/


Archiveweb.page – https://archiveweb.page/

Archiveweb.page Documentation – ArchiveWeb.page is the latest tool from Webrecorder to turn your browser into a full-featured interactive web archiving system! ArchiveWeb.page is available as an extension for any Chrome or Chromium based browsers. (A standalone app version is also in development.) NB: Function as a Chrome Extension. It seems that once you launch the app, you have to manually click through each link and it records your progress as a WARC file.


WARCreate – https://warcreate.com/

WARCreate Documentation – WARCreate is a Google Chrome extension that allows a user to create a Web ARChive (WARC) file from any browseable webpage. The resulting files can then be used with other tools like the Internet Archive’s open source Wayback Machine. The tool is an evolving product with the end result pushing toward being a personal web archiving solution for those that wish to securely archive their metadata in a standardize way. NB: Functions as a Chrome extension. Once launched, it captures whatever webpage you are currently on as a WARC file.



grab-site Documentation – The archivist’s web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns. grab-site is an easy preconfigured web crawler designed for backing up websites. Give grab-site a URL and it will recursively crawl the site and write WARC files. NB: only works with specific versions of Python: 3.7 or 3.8.



ArchiveBox Documentation – ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view sites you want to preserve offline. You can set it up as a command-line tool, web app, and desktop app (alpha), on Linux, macOS, and Windows (WSL/Docker). You can feed it URLs one at a time, or schedule regular imports from browser bookmarks or history, feeds like RSS, bookmark services like Pocket/Pinboard, and more. See input formats for a full list. It saves snapshots of the URLs you feed it in several formats: HTML, PDF, PNG screenshots, WARC, and more out-of-the-box, with a wide variety of content extracted and preserved automatically (article text, audio/video, git repos, etc.). See output formats for a full list. The goal is to sleep soundly knowing the part of the internet you care about will be automatically preserved in durable, easily accessible formats for decades after it goes down. NB: At first glance, this seemed like the best option as it’s super easy to use. However, it only crawls one layer deep. Not enough for our purposes.


Browsertrix Crawler – https://github.com/webrecorder/browsertrix-crawler

Browsertrix Documentation – Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. NB: As of 23 Oct, I have not yet successfully set this up and run it. I keep running into issues with Docker.


WebLicht ist eine online-Anwendung zur automatischen Annotion von Textcorpora.

Beschreibung von der WebLicht-Website:

WebLicht is an execution environment for automatic annotation of text corpora. Linguistic tools such as tokenizers, part of speech taggers, and parsers are encapsulated as web services, which can be combined by the user into custom processing chains. The resulting annotations can then be visualized in an appropriate way, such as in a table or tree format.

Link zum Tool: https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/Main_Page