create-llmstxt-py  by firecrawl

Python tool for generating LLM-ready website content

Created 7 months ago
261 stars

Top 97.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This Python script automates the generation of llms.txt and llms-full.txt files from any website, creating a standardized format for Large Language Models. It targets developers and researchers needing to ingest web content, offering an efficient way to map, scrape, and summarize site data. The primary benefit is streamlined LLM data preparation.

How It Works

The script leverages Firecrawl's /map endpoint to discover all website URLs, then scrapes each page for markdown content. It utilizes OpenAI's GPT-4o-mini model to generate concise titles (3-4 words) and descriptions (9-10 words) for each page. Processing occurs concurrently across batches of up to 10 URLs, with configurable limits and flexible output options to optimize performance and resource usage.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
  • Prerequisites: Python 3.7+, a Firecrawl API key, and an OpenAI API key. API keys can be configured via .env file, environment variables, or command-line arguments.
  • Usage: Run the script with python generate-llmstxt.py <website_url>. Options include limiting URLs (--max-urls), specifying output directories (--output-dir), and generating only llms.txt (--no-full-text).
  • Links: Firecrawl API Key, OpenAI API Key.

Highlighted Details

  • Automated website URL discovery via Firecrawl's /map endpoint.
  • AI-powered summarization using GPT-4o-mini for page titles and descriptions.
  • Parallel processing and batching (default 10 URLs) enhance generation speed.
  • Configurable processing limits and output formats cater to different needs.

Maintenance & Community

No specific details regarding maintainers, community channels, or roadmap are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting broad use, including commercial applications.

Limitations & Caveats

Processing time scales with website size and network response times. For very large sites, users may encounter rate limiting or memory issues, necessitating the use of --max-urls, --no-full-text, or manual adjustments to batch sizes and delays. Failed URL scrapes are logged and skipped.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 30 days

Explore Similar Projects

Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Dirk Englund Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
25 more.

firecrawl by firecrawl

1.8%
74k
API service for turning websites into LLM-ready data
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.