onefilellm  by jimmc414

CLI tool for LLM prompt data aggregation from various sources

Created 2 years ago
1,701 stars

Top 25.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This tool aggregates and preprocesses data from diverse sources like GitHub, ArXiv, YouTube, and web pages into a single, LLM-ready text file. It's designed for researchers, developers, and power users seeking to efficiently create dense prompts by consolidating information from various online and local resources.

How It Works

OneFileLLM functions as a command-line utility that automatically detects input source types (URLs, local paths, GitHub repos/PRs/issues, ArXiv, YouTube, Sci-Hub). It then employs specific processing modules for each source, leveraging libraries like BeautifulSoup for web scraping, PyPDF2 for PDFs, and YouTube Transcript API. The aggregated text undergoes preprocessing (stopwords, lowercasing) and is output in XML-encapsulated format, with token counts reported and the uncompressed text copied to the clipboard.

Quick Start & Requirements

  • Install dependencies: pip install -U -r requirements.txt
  • Optional: Create and activate a virtual environment (python -m venv .venv, source .venv/bin/activate).
  • For private GitHub repos, set a GITHUB_TOKEN environment variable or replace the placeholder in the script.
  • Usage: python onefilellm.py <URL_or_Path>
  • See: GitHub Repository

Highlighted Details

  • Supports aggregation from local files/directories, GitHub repos, PRs, issues, ArXiv papers, YouTube transcripts, and Sci-Hub papers via DOI/PMID.
  • Web crawling functionality extracts content from linked pages up to a configurable depth.
  • Output is automatically copied to the clipboard and includes token count reporting.
  • Recent updates include file/directory exclusion and XML encapsulation for improved LLM performance.

Maintenance & Community

The project shows active development with recent updates in January 2025 and July 2024. Configuration details for file type inclusion/exclusion and web crawl depth are available in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The README does not specify a license, which may impact commercial adoption. The tool relies on external services like Sci-Hub, which may have availability issues. Configuration for file type filtering and web crawl depth requires direct code modification.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
20
Issues (30d)
3
Star History
59 stars in the last 30 days

Explore Similar Projects

Starred by Li Jiang Li Jiang(Coauthor of AutoGen; Engineer at Microsoft), Jeremy Howard Jeremy Howard(Cofounder of fast.ai), and
2 more.

trafilatura by adbar

0.5%
5k
Python package for web text extraction
Created 6 years ago
Updated 6 days ago
Feedback? Help us improve.