sitefetch  by egoist

CLI tool for site scraping into a text file for AI models

created 6 months ago
1,587 stars

Top 26.9% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a command-line tool and a Node.js API for fetching an entire website and saving its content into a single text file, optimized for use with AI models. It targets developers and researchers needing to ingest large amounts of web content for analysis or training.

How It Works

The tool recursively crawls a given URL, extracting readable content from each page using Mozilla's Readability.js. Users can specify CSS selectors to refine content extraction on pages where Readability.js might not yield optimal results. It supports concurrent fetching for improved performance and uses micromatch for flexible URL filtering.

Quick Start & Requirements

  • Install globally: npm i -g sitefetch or bun i -g sitefetch
  • One-off usage: bunx sitefetch or npx sitefetch
  • Usage: sitefetch https://example.com -o output.txt
  • Dependencies: Node.js (version not specified, but Bun/npm/npx implies it).
  • Documentation: API options in types.ts

Highlighted Details

  • Fetches entire sites for AI model consumption.
  • Uses Mozilla Readability.js for content extraction.
  • Supports CSS selectors for content refinement.
  • Offers concurrent fetching for performance.
  • Includes a Node.js API for programmatic use.

Maintenance & Community

  • Developed by egoist.
  • No explicit community links (Discord/Slack) or roadmap mentioned in the README.

Licensing & Compatibility

  • License: MIT.
  • Compatible with commercial and closed-source projects due to its permissive MIT license.

Limitations & Caveats

The README does not specify versioning or a clear maintenance cadence. Advanced usage might require understanding CSS selectors and micromatch patterns.

Health Check
Last commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
206 stars in the last 90 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
7 more.

firecrawl by mendableai

1.9%
44k
API service for turning websites into LLM-ready data
created 1 year ago
updated 1 day ago
Feedback? Help us improve.