llmstxt-generator  by firecrawl

CLI tool for LLM training/inference text file generation

Created 10 months ago
461 stars

Top 65.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a tool to generate consolidated text files from websites, specifically designed for Large Language Model (LLM) training and inference. It targets developers and researchers needing to process web content efficiently, offering a streamlined way to prepare data for LLM applications.

How It Works

The system leverages FireCrawl for web crawling to extract content from specified URLs. It then utilizes GPT-4-mini for text processing, consolidating the extracted information into a single text file. Two output formats are generated: a standard llms.txt and a more comprehensive llms-full.txt.

Quick Start & Requirements

  • Web Interface: Visit llmstxt.firecrawl.dev for browser-based generation.
  • API Endpoint: GET https://llmstxt.firecrawl.dev/[YOUR_URL_HERE]
  • Local Development: Requires npm install and npm run dev.
  • Prerequisites: A .env file with FIRECRAWL_API_KEY, SUPABASE_URL, SUPABASE_KEY, and OPENAI_API_KEY is necessary for local setup.

Highlighted Details

  • Powered by FireCrawl for web crawling and GPT-4-mini for text processing.
  • Generates both llms.txt and llms-full.txt output formats.
  • Offers both a web interface and an API endpoint.
  • No API key required for basic usage via the web interface.

Maintenance & Community

The project is associated with @firecrawl_dev. Further community or maintenance details are not provided in the README.

Licensing & Compatibility

The licensing information is not specified in the README.

Limitations & Caveats

Processing times can be several minutes due to crawling and LLM operations. Local development requires specific API keys for FireCrawl, Supabase, and OpenAI.

Health Check
Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.8%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 5 months ago
Updated 1 month ago
Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.