Fork me on GitHub! File and triage issues, fix bugs, refactor code, submit pull requests… all welcome! Discussion in #archiveteam-dev (on hackint).
The warrior uses the following repos:
Client code includes code that the Warrior executes.
- warrior3 - bootstrap and tools to build the image
- Bootstrap code that is pulled from GitHub by the appliance and starts a docker container
- archiveteam/warrior-dockerfile - the container
- Instructions to boostrap the docker container
- warrior2 - warrior runner code
- Main code that runs inside of the docker container
- Library that helps build grab scripts, the web interface, and pipeline engine for the warrior. The name "seesaw" comes from its original behavior: download, upload, and repeat.
Projects are in separate repositories typically with the name
-grab as a suffix.
Item lists that are loaded into the tracker are sometimes saved into a repo with
-items as a suffix. Scripts to build searchable index HTML pages are usually suffixed with
Server code includes code that the Tracker executes.
universal-tracker - Ruby
- The server of which the Seesaw contacts
warrior-hq - Ruby
- The server of which the warrior appliances contact for project metadata
archiveteam-megawarc-factory - shell
- The scripts that bundles the WARC files.
URLTeam code is independent from the tracker and warrior.
- The client code that scrapes the shortlinks. It includes a pipeline shim to run the code.
- The server code for the tracker.
- A pipeline shim to run the code.
- The code for both the client library and tracker.
- Dockerfile that runs the warrior inside a Docker container.
ArchiveBot - Ruby, Python, Lua
- An IRC bot for archiving websites.
wget-lua - C, Lua
- A patched version of Wget for web crawling.
standalone-readme-template - Markdown
- A template for readme files included in grab repositories.
archiveteam-dev-env - Shell
- Ubuntu preseed for a developer environment for ArchiveTeam projects.
wpull - Python
- A Wget-compatible web downloader/crawler.