ingest by sammcj

Markdown generator for LLM ingestion

Created 1 year ago

355 stars

Top 78.8% on SourcePulse

Project Summary

This tool parses files and websites into a single markdown file or directly to an LLM, targeting developers and researchers preparing data for AI models. It streamlines data ingestion by offering features like code compression, VRAM estimation, and LLM integration, reducing manual effort and improving compatibility with AI models.

How It Works

Ingest traverses directory structures, optionally compressing code using Tree-sitter to retain structural information while omitting implementation details. It tokenizes content and can integrate directly with LLMs via OpenAI-compatible APIs (like Ollama) or save output to files. A key feature is its VRAM estimation and model compatibility checking, leveraging a separate package to help users determine if their data fits within specified model constraints.

Quick Start & Requirements

Install: go install github.com/sammcj/ingest@HEAD (recommended) or via a provided curl script.
Prerequisites: Go installation. Downloads a cl100k_base.tiktoken tokenizer on first run.
Docs: https://github.com/sammcj/ingest

Highlighted Details

Code compression using Tree-sitter for Go, Python, JavaScript, Bash, C, and CSS.
VRAM estimation and model compatibility checks for GGUF and ExLlamaV2 models.
Direct LLM integration with Ollama and OpenAI-compatible APIs.
Web crawling capabilities with domain restrictions and depth control.
Git diff and log inclusion for version-controlled projects.

Maintenance & Community

Project maintained by Sam McLeod.
Contributions are welcome via Pull Requests.

Licensing & Compatibility

MIT License. Permissive for commercial use and closed-source linking.

Limitations & Caveats

The Tree-sitter compression is experimental and currently supports a limited set of languages. The README notes that version printing (-V) is a work-in-progress.

ingest by sammcj

Explore Similar Projects

glimpse by seatedro

orca by santiagomed

codefetch by regenrek

CodeWeaver by tesserato

code2prompt by raphaelmansuy

vision-parse by iamarunbrahma

llmstxt-generator by firecrawl

e2m by wisupai

chunkr by lumina-ai-inc

synthetic-data-kit by meta-llama

shotgun_code by glebkudr

repomix by yamadashy