onefilellm by jimmc414

CLI tool for LLM prompt data aggregation from various sources

Created 2 years ago

1,878 stars

Top 22.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Faris Masad

Cofounder of Replit

Project Summary

This tool aggregates and preprocesses data from diverse sources like GitHub, ArXiv, YouTube, and web pages into a single, LLM-ready text file. It's designed for researchers, developers, and power users seeking to efficiently create dense prompts by consolidating information from various online and local resources.

How It Works

OneFileLLM functions as a command-line utility that automatically detects input source types (URLs, local paths, GitHub repos/PRs/issues, ArXiv, YouTube, Sci-Hub). It then employs specific processing modules for each source, leveraging libraries like BeautifulSoup for web scraping, PyPDF2 for PDFs, and YouTube Transcript API. The aggregated text undergoes preprocessing (stopwords, lowercasing) and is output in XML-encapsulated format, with token counts reported and the uncompressed text copied to the clipboard.

Quick Start & Requirements

Install dependencies: pip install -U -r requirements.txt
Optional: Create and activate a virtual environment (python -m venv .venv, source .venv/bin/activate).
For private GitHub repos, set a GITHUB_TOKEN environment variable or replace the placeholder in the script.
Usage: python onefilellm.py <URL_or_Path>
See: GitHub Repository

Highlighted Details

Supports aggregation from local files/directories, GitHub repos, PRs, issues, ArXiv papers, YouTube transcripts, and Sci-Hub papers via DOI/PMID.
Web crawling functionality extracts content from linked pages up to a configurable depth.
Output is automatically copied to the clipboard and includes token count reporting.
Recent updates include file/directory exclusion and XML encapsulation for improved LLM performance.

Maintenance & Community

The project shows active development with recent updates in January 2025 and July 2024. Configuration details for file type inclusion/exclusion and web crawl depth are available in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. The tool relies on external services like Sci-Hub, which may have availability issues. Configuration for file type filtering and web crawl depth requires direct code modification.

onefilellm by jimmc414

Explore Similar Projects

llm-reader by m92vyas

dom-to-semantic-markdown by romansky

doctran by finic-ai

create-llmstxt-py by firecrawl

llmstxt-generator by firecrawl

openwebtext by jcpeterson

thepipe by emcf

llmsherpa by nlmatics

MNBVC by esbatmop

trafilatura by adbar

wiseflow by TeamWiseFlow

firecrawl by firecrawl