create-llmstxt-py by firecrawl

Python tool for generating LLM-ready website content

Created 8 months ago

277 stars

Top 93.8% on SourcePulse

View on GitHub

2 Experts Love This Project

Eric Ciarla

Cofounder of Firecrawl

Nicolas Camara

Cofounder of Firecrawl

Project Summary

Summary

This Python script automates the generation of llms.txt and llms-full.txt files from any website, creating a standardized format for Large Language Models. It targets developers and researchers needing to ingest web content, offering an efficient way to map, scrape, and summarize site data. The primary benefit is streamlined LLM data preparation.

How It Works

The script leverages Firecrawl's /map endpoint to discover all website URLs, then scrapes each page for markdown content. It utilizes OpenAI's GPT-4o-mini model to generate concise titles (3-4 words) and descriptions (9-10 words) for each page. Processing occurs concurrently across batches of up to 10 URLs, with configurable limits and flexible output options to optimize performance and resource usage.

Quick Start & Requirements

Installation: Clone the repository and install dependencies using pip install -r requirements.txt.
Prerequisites: Python 3.7+, a Firecrawl API key, and an OpenAI API key. API keys can be configured via .env file, environment variables, or command-line arguments.
Usage: Run the script with python generate-llmstxt.py <website_url>. Options include limiting URLs (--max-urls), specifying output directories (--output-dir), and generating only llms.txt (--no-full-text).
Links: Firecrawl API Key, OpenAI API Key.

Highlighted Details

Automated website URL discovery via Firecrawl's /map endpoint.
AI-powered summarization using GPT-4o-mini for page titles and descriptions.
Parallel processing and batching (default 10 URLs) enhance generation speed.
Configurable processing limits and output formats cater to different needs.

Maintenance & Community

No specific details regarding maintainers, community channels, or roadmap are provided in the README.

Licensing & Compatibility

The project is released under the MIT License, permitting broad use, including commercial applications.

Limitations & Caveats

Processing time scales with website size and network response times. For very large sites, users may encounter rate limiting or memory issues, necessitating the use of --max-urls, --no-full-text, or manual adjustments to batch sizes and delays. Failed URL scrapes are logged and skipped.

Health Check

Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

9 stars in the last 30 days