chunkr  by lumina-ai-inc

Document intelligence API for RAG/LLM workflows

created 11 months ago
2,321 stars

Top 20.1% on sourcepulse

GitHubView on GitHub
Project Summary

Chunkr is an open-source document intelligence API designed to transform complex documents into RAG/LLM-ready data. It targets developers and researchers needing to process PDFs, PPTs, Word docs, and images for AI applications, offering layout analysis, OCR, and semantic chunking.

How It Works

Chunkr leverages a multi-stage pipeline that includes document parsing, optical character recognition (OCR) with bounding boxes, and layout analysis to generate structured HTML and Markdown outputs. It supports Vision-Language Models (VLMs) for enhanced understanding and offers flexible LLM integration via configuration files or environment variables, enabling users to select and manage various LLM providers.

Quick Start & Requirements

  • Install: pip install chunkr-ai
  • Prerequisites: Docker and Docker Compose for self-hosting. NVIDIA Container Toolkit is recommended for GPU support.
  • Self-Hosted Deployment: Requires cloning the repository, setting up .env and models.yaml files, and running via docker compose up -d (with variations for CPU and Mac ARM).
  • Documentation: chunkr.ai

Highlighted Details

  • Supports multiple document formats (PDF, PPT, Word, images).
  • Provides structured output options: HTML, Markdown, plain text, JSON.
  • Offers self-hosted deployment via Docker Compose and Kubernetes (Helm chart available).
  • Flexible LLM configuration supporting multiple providers and rate limiting.

Maintenance & Community

Licensing & Compatibility

  • License: Dual-licensed: GNU Affero General Public License v3.0 (AGPL-3.0) and a Commercial License.
  • Compatibility: AGPL-3.0 terms may require derivative works to be open-sourced if linked. Commercial use requires a separate license.

Limitations & Caveats

The AGPL-3.0 license imposes significant obligations on users who modify or distribute the software, potentially requiring them to open-source their own code. Specific VLM processing controls are mentioned but not detailed in the README.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
15
Issues (30d)
3
Star History
324 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Tim J. Baek Tim J. Baek(Founder of Open WebUI), and
2 more.

llmware by llmware-ai

0.2%
14k
Framework for enterprise RAG pipelines using small, specialized models
created 1 year ago
updated 1 week ago
Feedback? Help us improve.