ArchiveBox  by ArchiveBox

Self-hosted web archiver for preserving web content

created 8 years ago
24,586 stars

Top 1.7% on sourcepulse

GitHubView on GitHub
Project Summary

ArchiveBox is an open-source, self-hosted web archiving solution designed for individuals and organizations to preserve digital content. It captures web pages, social media posts, videos, and code repositories in durable, accessible formats, ensuring long-term data availability and user control.

How It Works

ArchiveBox leverages a suite of industry-standard tools like Chrome, wget, and yt-dlp to capture content. It saves snapshots in multiple redundant formats including HTML, PNG, PDF, WARC, and plain text. The system is designed for data longevity, storing information in ordinary files and folders, making it readable without ArchiveBox itself. It offers a modular architecture, allowing users to enable or disable specific extractors based on their needs.

Quick Start & Requirements

  • Installation: Recommended via Docker Compose (docker compose run archivebox init --setup). Alternatives include plain Docker, pip install archivebox, or an auto-setup script.
  • Prerequisites: Docker (recommended), Python >= 3.10, Node >= 18 (for pip install). Chromium is required for certain extractors.
  • Setup: Initial setup involves running init --setup.
  • Documentation: Quickstart, Demo, Documentation

Highlighted Details

  • Supports importing from various sources: URLs, browser history/bookmarks, Pocket, Pinboard, RSS feeds, and a browser extension.
  • Extracts rich content: HTML, screenshots, PDFs, articles, audio/video (via yt-dlp), and Git repositories.
  • Offers multiple interfaces: CLI, self-hosted Web UI, Python API (BETA), REST API (ALPHA), and a filesystem/SQL interface.
  • Can optionally save archives to archive.org for redundancy.

Maintenance & Community

  • Active development with a roadmap and changelog available on the wiki.
  • Community support via Zulip chat.
  • The project is a 501(c)(3) nonprofit.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Designed for self-hosting; commercial use and integration into closed-source projects are permitted under the Apache 2.0 license.

Limitations & Caveats

  • Archiving private content requires careful configuration to avoid data leakage, especially regarding cookies and session tokens.
  • Archived JavaScript can pose security risks when viewed, as cross-site protections are limited within the Web UI. Users can disable specific extractors like wget and DOM to mitigate this.
  • Some sites actively block archiving attempts; workarounds involve configuring user agents, cookies, or using alternative front-ends.
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
4
Star History
913 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.