CLI tool for site scraping into a text file for AI models
Top 26.9% on sourcepulse
This project provides a command-line tool and a Node.js API for fetching an entire website and saving its content into a single text file, optimized for use with AI models. It targets developers and researchers needing to ingest large amounts of web content for analysis or training.
How It Works
The tool recursively crawls a given URL, extracting readable content from each page using Mozilla's Readability.js. Users can specify CSS selectors to refine content extraction on pages where Readability.js might not yield optimal results. It supports concurrent fetching for improved performance and uses micromatch for flexible URL filtering.
Quick Start & Requirements
npm i -g sitefetch
or bun i -g sitefetch
bunx sitefetch
or npx sitefetch
sitefetch https://example.com -o output.txt
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify versioning or a clear maintenance cadence. Advanced usage might require understanding CSS selectors and micromatch patterns.
6 months ago
1 day