Discover and explore top open-source AI tools and projects—updated daily.
firecrawlPython tool for generating LLM-ready website content
Top 97.5% on SourcePulse
Summary
This Python script automates the generation of llms.txt and llms-full.txt files from any website, creating a standardized format for Large Language Models. It targets developers and researchers needing to ingest web content, offering an efficient way to map, scrape, and summarize site data. The primary benefit is streamlined LLM data preparation.
How It Works
The script leverages Firecrawl's /map endpoint to discover all website URLs, then scrapes each page for markdown content. It utilizes OpenAI's GPT-4o-mini model to generate concise titles (3-4 words) and descriptions (9-10 words) for each page. Processing occurs concurrently across batches of up to 10 URLs, with configurable limits and flexible output options to optimize performance and resource usage.
Quick Start & Requirements
pip install -r requirements.txt..env file, environment variables, or command-line arguments.python generate-llmstxt.py <website_url>. Options include limiting URLs (--max-urls), specifying output directories (--output-dir), and generating only llms.txt (--no-full-text).Highlighted Details
/map endpoint.Maintenance & Community
No specific details regarding maintainers, community channels, or roadmap are provided in the README.
Licensing & Compatibility
The project is released under the MIT License, permitting broad use, including commercial applications.
Limitations & Caveats
Processing time scales with website size and network response times. For very large sites, users may encounter rate limiting or memory issues, necessitating the use of --max-urls, --no-full-text, or manual adjustments to batch sizes and delays. Failed URL scrapes are logged and skipped.
7 months ago
Inactive
nlmatics
Dirk Englund(MIT EECS Professor and Cofounder of Axiomatic AI), and
firecrawl