sitefetch by egoist

CLI tool for site scraping into a text file for AI models

Created 1 year ago

1,638 stars

Top 25.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Travis Fischer

Founder of Agentic

Project Summary

This project provides a command-line tool and a Node.js API for fetching an entire website and saving its content into a single text file, optimized for use with AI models. It targets developers and researchers needing to ingest large amounts of web content for analysis or training.

How It Works

The tool recursively crawls a given URL, extracting readable content from each page using Mozilla's Readability.js. Users can specify CSS selectors to refine content extraction on pages where Readability.js might not yield optimal results. It supports concurrent fetching for improved performance and uses micromatch for flexible URL filtering.

Quick Start & Requirements

Install globally: npm i -g sitefetch or bun i -g sitefetch
One-off usage: bunx sitefetch or npx sitefetch
Usage: sitefetch https://example.com -o output.txt
Dependencies: Node.js (version not specified, but Bun/npm/npx implies it).
Documentation: API options in types.ts

Highlighted Details

Fetches entire sites for AI model consumption.
Uses Mozilla Readability.js for content extraction.
Supports CSS selectors for content refinement.
Offers concurrent fetching for performance.
Includes a Node.js API for programmatic use.

Maintenance & Community

Developed by egoist.
No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

License: MIT.
Compatible with commercial and closed-source projects due to its permissive MIT license.

Limitations & Caveats

The README does not specify versioning or a clear maintenance cadence. Advanced usage might require understanding CSS selectors and micromatch patterns.

Health Check

Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

6 stars in the last 30 days