onefilellm  by jimmc414

CLI tool for LLM prompt data aggregation from various sources

created 2 years ago
1,530 stars

Top 27.6% on sourcepulse

GitHubView on GitHub
Project Summary

This tool aggregates and preprocesses data from diverse sources like GitHub, ArXiv, YouTube, and web pages into a single, LLM-ready text file. It's designed for researchers, developers, and power users seeking to efficiently create dense prompts by consolidating information from various online and local resources.

How It Works

OneFileLLM functions as a command-line utility that automatically detects input source types (URLs, local paths, GitHub repos/PRs/issues, ArXiv, YouTube, Sci-Hub). It then employs specific processing modules for each source, leveraging libraries like BeautifulSoup for web scraping, PyPDF2 for PDFs, and YouTube Transcript API. The aggregated text undergoes preprocessing (stopwords, lowercasing) and is output in XML-encapsulated format, with token counts reported and the uncompressed text copied to the clipboard.

Quick Start & Requirements

  • Install dependencies: pip install -U -r requirements.txt
  • Optional: Create and activate a virtual environment (python -m venv .venv, source .venv/bin/activate).
  • For private GitHub repos, set a GITHUB_TOKEN environment variable or replace the placeholder in the script.
  • Usage: python onefilellm.py <URL_or_Path>
  • See: GitHub Repository

Highlighted Details

  • Supports aggregation from local files/directories, GitHub repos, PRs, issues, ArXiv papers, YouTube transcripts, and Sci-Hub papers via DOI/PMID.
  • Web crawling functionality extracts content from linked pages up to a configurable depth.
  • Output is automatically copied to the clipboard and includes token count reporting.
  • Recent updates include file/directory exclusion and XML encapsulation for improved LLM performance.

Maintenance & Community

The project shows active development with recent updates in January 2025 and July 2024. Configuration details for file type inclusion/exclusion and web crawl depth are available in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. The tool relies on external services like Sci-Hub, which may have availability issues. Configuration for file type filtering and web crawl depth requires direct code modification.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
3
Issues (30d)
2
Star History
287 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.