llmstxt-generator  by firecrawl

CLI tool for LLM training/inference text file generation

Created 11 months ago
478 stars

Top 64.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project provides a tool to generate consolidated text files from websites, specifically designed for Large Language Model (LLM) training and inference. It targets developers and researchers needing to process web content efficiently, offering a streamlined way to prepare data for LLM applications.

How It Works

The system leverages FireCrawl for web crawling to extract content from specified URLs. It then utilizes GPT-4-mini for text processing, consolidating the extracted information into a single text file. Two output formats are generated: a standard llms.txt and a more comprehensive llms-full.txt.

Quick Start & Requirements

  • Web Interface: Visit llmstxt.firecrawl.dev for browser-based generation.
  • API Endpoint: GET https://llmstxt.firecrawl.dev/[YOUR_URL_HERE]
  • Local Development: Requires npm install and npm run dev.
  • Prerequisites: A .env file with FIRECRAWL_API_KEY, SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY is necessary for local setup.

Highlighted Details

  • Powered by FireCrawl for web crawling and GPT-4-mini for text processing.
  • Generates both llms.txt and llms-full.txt output formats.
  • Offers both a web interface and an API endpoint.
  • No API key required for basic usage via the web interface.

Maintenance & Community

The project is associated with @firecrawl_dev. Further community or maintenance details are not provided in the README.

Licensing & Compatibility

The licensing information is not specified in the README.

Limitations & Caveats

Processing times can be several minutes due to crawling and LLM operations. Local development requires specific API keys for FireCrawl, Supabase, and OpenAI.

Health Check
Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.5%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 7 months ago
Updated 1 week ago
Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.3%
5k
Python package for web text extraction
Created 6 years ago
Updated 1 month ago
Feedback? Help us improve.