CLI tool for LLM prompt data aggregation from various sources
Top 27.6% on sourcepulse
This tool aggregates and preprocesses data from diverse sources like GitHub, ArXiv, YouTube, and web pages into a single, LLM-ready text file. It's designed for researchers, developers, and power users seeking to efficiently create dense prompts by consolidating information from various online and local resources.
How It Works
OneFileLLM functions as a command-line utility that automatically detects input source types (URLs, local paths, GitHub repos/PRs/issues, ArXiv, YouTube, Sci-Hub). It then employs specific processing modules for each source, leveraging libraries like BeautifulSoup for web scraping, PyPDF2 for PDFs, and YouTube Transcript API. The aggregated text undergoes preprocessing (stopwords, lowercasing) and is output in XML-encapsulated format, with token counts reported and the uncompressed text copied to the clipboard.
Quick Start & Requirements
pip install -U -r requirements.txt
python -m venv .venv
, source .venv/bin/activate
).GITHUB_TOKEN
environment variable or replace the placeholder in the script.python onefilellm.py <URL_or_Path>
Highlighted Details
Maintenance & Community
The project shows active development with recent updates in January 2025 and July 2024. Configuration details for file type inclusion/exclusion and web crawl depth are available in the README.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.
Limitations & Caveats
The README does not specify a license, which may impact commercial adoption. The tool relies on external services like Sci-Hub, which may have availability issues. Configuration for file type filtering and web crawl depth requires direct code modification.
2 weeks ago
1 day