ArchiveBox  by ArchiveBox

Self-hosted web archiver for preserving web content

Created 8 years ago
24,972 stars

Top 1.6% on SourcePulse

GitHubView on GitHub
Project Summary

ArchiveBox is an open-source, self-hosted web archiving solution designed for individuals and organizations to preserve digital content. It captures web pages, social media posts, videos, and code repositories in durable, accessible formats, ensuring long-term data availability and user control.

How It Works

ArchiveBox leverages a suite of industry-standard tools like Chrome, wget, and yt-dlp to capture content. It saves snapshots in multiple redundant formats including HTML, PNG, PDF, WARC, and plain text. The system is designed for data longevity, storing information in ordinary files and folders, making it readable without ArchiveBox itself. It offers a modular architecture, allowing users to enable or disable specific extractors based on their needs.

Quick Start & Requirements

  • Installation: Recommended via Docker Compose (docker compose run archivebox init --setup). Alternatives include plain Docker, pip install archivebox, or an auto-setup script.
  • Prerequisites: Docker (recommended), Python >= 3.10, Node >= 18 (for pip install). Chromium is required for certain extractors.
  • Setup: Initial setup involves running init --setup.
  • Documentation: Quickstart, Demo, Documentation

Highlighted Details

  • Supports importing from various sources: URLs, browser history/bookmarks, Pocket, Pinboard, RSS feeds, and a browser extension.
  • Extracts rich content: HTML, screenshots, PDFs, articles, audio/video (via yt-dlp), and Git repositories.
  • Offers multiple interfaces: CLI, self-hosted Web UI, Python API (BETA), REST API (ALPHA), and a filesystem/SQL interface.
  • Can optionally save archives to archive.org for redundancy.

Maintenance & Community

  • Active development with a roadmap and changelog available on the wiki.
  • Community support via Zulip chat.
  • The project is a 501(c)(3) nonprofit.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Designed for self-hosting; commercial use and integration into closed-source projects are permitted under the Apache 2.0 license.

Limitations & Caveats

  • Archiving private content requires careful configuration to avoid data leakage, especially regarding cookies and session tokens.
  • Archived JavaScript can pose security risks when viewed, as cross-site protections are limited within the Web UI. Users can disable specific extractors like wget and DOM to mitigate this.
  • Some sites actively block archiving attempts; workarounds involve configuring user agents, cookies, or using alternative front-ends.
Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
3
Star History
222 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.