A B C D E F G H I J K L M N O P Q R S T U V W X Z

grab-site (web crawler)

 

grab-site is the archivist’s web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

GitHub repo: https://github.com/ArchiveTeam/grab-site

Installation steps:

Install on Ubuntu 18.04, 20.04, 22.04, Debian 10 (buster), Debian 11 (bullseye)

  1. On Debian, use su to become root if sudo is not configured to give you access.

sudo apt-get updatesudo apt-get install –no-install-recommends \    wget ca-certificates git build-essential libssl-dev zlib1g-dev \    libbz2-dev libreadline-dev libsqlite3-dev libffi-dev libxml2-dev \    libxslt1-dev libre2-dev pkg-config

If you see Unable to locate package, run the two commands again.

  1. As a non-root user:

wget https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installerchmod +x pyenv-installer./pyenv-installer~/.pyenv/bin/pyenv install 3.8.15~/.pyenv/versions/3.8.15/bin/python -m venv ~/gs-venv~/gs-venv/bin/pip install –no-binary lxml –upgrade git+https://github.com/ArchiveTeam/grab-site

–no-binary lxml is necessary for the html5-parser build.

  1. Add this to your ~/.bashrc or ~/.zshrc:

PATH=“$PATH:$HOME/gs-venv/bin“

and then restart your shell (e.g. by opening a new terminal tab/window).

Upgrade an existing install

To update grab-site, simply run the ~/gs-venv/bin/pip install … or nix-env … command used to install it originally (see above).

After upgrading, stop gs-server with kill or ctrl-c, then start it again. Existing grab-site crawls will automatically reconnect to the new server.

Using grab-site

First, start the dashboard with:

gs-server

and point your browser to http://127.0.0.1:29000/

Note: gs-server listens on all interfaces by default, so you can reach the dashboard by a non-localhost IP as well, e.g. a LAN or WAN IP. (Sub-note: no code execution capabilities are exposed on any interface.)

Then, start as many crawls as you want with:

grab-site ‚URL‘

>>Do this inside tmux unless they’re very short crawls.<<

grab-site outputs WARCs, logs, and control files to a new subdirectory in the directory from which you launched grab-site, referred to here as „DIR“. (Use ls -lrt to find it.)

You can pass multiple URL arguments to include them in the same crawl, whether they are on the same domain or different domains entirely.

See „SolrWayback“ entry in the Knowledge Base for help with accessing WARC files.