ingest  by sammcj

Markdown generator for LLM ingestion

Created 1 year ago
298 stars

Top 89.2% on SourcePulse

GitHubView on GitHub
Project Summary

This tool parses files and websites into a single markdown file or directly to an LLM, targeting developers and researchers preparing data for AI models. It streamlines data ingestion by offering features like code compression, VRAM estimation, and LLM integration, reducing manual effort and improving compatibility with AI models.

How It Works

Ingest traverses directory structures, optionally compressing code using Tree-sitter to retain structural information while omitting implementation details. It tokenizes content and can integrate directly with LLMs via OpenAI-compatible APIs (like Ollama) or save output to files. A key feature is its VRAM estimation and model compatibility checking, leveraging a separate package to help users determine if their data fits within specified model constraints.

Quick Start & Requirements

  • Install: go install github.com/sammcj/ingest@HEAD (recommended) or via a provided curl script.
  • Prerequisites: Go installation. Downloads a cl100k_base.tiktoken tokenizer on first run.
  • Docs: https://github.com/sammcj/ingest

Highlighted Details

  • Code compression using Tree-sitter for Go, Python, JavaScript, Bash, C, and CSS.
  • VRAM estimation and model compatibility checks for GGUF and ExLlamaV2 models.
  • Direct LLM integration with Ollama and OpenAI-compatible APIs.
  • Web crawling capabilities with domain restrictions and depth control.
  • Git diff and log inclusion for version-controlled projects.

Maintenance & Community

  • Project maintained by Sam McLeod.
  • Contributions are welcome via Pull Requests.

Licensing & Compatibility

  • MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The Tree-sitter compression is experimental and currently supports a limited set of languages. The README notes that version printing (-V) is a work-in-progress.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

0.8%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 5 months ago
Updated 1 month ago
Starred by David Cournapeau David Cournapeau(Author of scikit-learn), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
7 more.

repomix by yamadashy

0.6%
19k
CLI tool to pack codebases into AI-friendly formats for LLMs
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.