ArchiveBox is an open-source, self-hosted web archiving solution designed for individuals and organizations to preserve digital content. It captures web pages, social media posts, videos, and code repositories in durable, accessible formats, ensuring long-term data availability and user control.
How It Works
ArchiveBox leverages a suite of industry-standard tools like Chrome, wget
, and yt-dlp
to capture content. It saves snapshots in multiple redundant formats including HTML, PNG, PDF, WARC, and plain text. The system is designed for data longevity, storing information in ordinary files and folders, making it readable without ArchiveBox itself. It offers a modular architecture, allowing users to enable or disable specific extractors based on their needs.
Quick Start & Requirements
- Installation: Recommended via Docker Compose (
docker compose run archivebox init --setup
). Alternatives include plain Docker, pip install archivebox
, or an auto-setup script.
- Prerequisites: Docker (recommended), Python >= 3.10, Node >= 18 (for pip install). Chromium is required for certain extractors.
- Setup: Initial setup involves running
init --setup
.
- Documentation: Quickstart, Demo, Documentation
Highlighted Details
- Supports importing from various sources: URLs, browser history/bookmarks, Pocket, Pinboard, RSS feeds, and a browser extension.
- Extracts rich content: HTML, screenshots, PDFs, articles, audio/video (via
yt-dlp
), and Git repositories.
- Offers multiple interfaces: CLI, self-hosted Web UI, Python API (BETA), REST API (ALPHA), and a filesystem/SQL interface.
- Can optionally save archives to archive.org for redundancy.
Maintenance & Community
- Active development with a roadmap and changelog available on the wiki.
- Community support via Zulip chat.
- The project is a 501(c)(3) nonprofit.
Licensing & Compatibility
- License: Apache 2.0.
- Compatibility: Designed for self-hosting; commercial use and integration into closed-source projects are permitted under the Apache 2.0 license.
Limitations & Caveats
- Archiving private content requires careful configuration to avoid data leakage, especially regarding cookies and session tokens.
- Archived JavaScript can pose security risks when viewed, as cross-site protections are limited within the Web UI. Users can disable specific extractors like
wget
and DOM to mitigate this.
- Some sites actively block archiving attempts; workarounds involve configuring user agents, cookies, or using alternative front-ends.