ArchiveBox by ArchiveBox

Self-hosted web archiver for preserving web content

Created 8 years ago

26,349 stars

Top 1.4% on SourcePulse

View on GitHub

7 Experts Love This Project

Gregor Zunic

Cofounder of Browser Use

Travis Fischer

Founder of Agentic

Jonathan Ragan-Kelley

Professor at MIT

Andrey Vasnetsov

Cofounder of Qdrant

and 3 more!

Project Summary

ArchiveBox is an open-source, self-hosted web archiving solution designed for individuals and organizations to preserve digital content. It captures web pages, social media posts, videos, and code repositories in durable, accessible formats, ensuring long-term data availability and user control.

How It Works

ArchiveBox leverages a suite of industry-standard tools like Chrome, wget, and yt-dlp to capture content. It saves snapshots in multiple redundant formats including HTML, PNG, PDF, WARC, and plain text. The system is designed for data longevity, storing information in ordinary files and folders, making it readable without ArchiveBox itself. It offers a modular architecture, allowing users to enable or disable specific extractors based on their needs.

Quick Start & Requirements

Installation: Recommended via Docker Compose (docker compose run archivebox init --setup). Alternatives include plain Docker, pip install archivebox, or an auto-setup script.
Prerequisites: Docker (recommended), Python >= 3.10, Node >= 18 (for pip install). Chromium is required for certain extractors.
Setup: Initial setup involves running init --setup.
Documentation: Quickstart, Demo, Documentation

Highlighted Details

Supports importing from various sources: URLs, browser history/bookmarks, Pocket, Pinboard, RSS feeds, and a browser extension.
Extracts rich content: HTML, screenshots, PDFs, articles, audio/video (via yt-dlp), and Git repositories.
Offers multiple interfaces: CLI, self-hosted Web UI, Python API (BETA), REST API (ALPHA), and a filesystem/SQL interface.
Can optionally save archives to archive.org for redundancy.

Maintenance & Community

Active development with a roadmap and changelog available on the wiki.
Community support via Zulip chat.
The project is a 501(c)(3) nonprofit.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Designed for self-hosting; commercial use and integration into closed-source projects are permitted under the Apache 2.0 license.

Limitations & Caveats

Archiving private content requires careful configuration to avoid data leakage, especially regarding cookies and session tokens.
Archived JavaScript can pose security risks when viewed, as cross-site protections are limited within the Web UI. Users can disable specific extractors like wget and DOM to mitigate this.
Some sites actively block archiving attempts; workarounds involve configuring user agents, cookies, or using alternative front-ends.

Health Check

Last Commit

5 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

473 stars in the last 30 days