Document intelligence API for RAG/LLM workflows
Top 20.1% on sourcepulse
Chunkr is an open-source document intelligence API designed to transform complex documents into RAG/LLM-ready data. It targets developers and researchers needing to process PDFs, PPTs, Word docs, and images for AI applications, offering layout analysis, OCR, and semantic chunking.
How It Works
Chunkr leverages a multi-stage pipeline that includes document parsing, optical character recognition (OCR) with bounding boxes, and layout analysis to generate structured HTML and Markdown outputs. It supports Vision-Language Models (VLMs) for enhanced understanding and offers flexible LLM integration via configuration files or environment variables, enabling users to select and manage various LLM providers.
Quick Start & Requirements
pip install chunkr-ai
.env
and models.yaml
files, and running via docker compose up -d
(with variations for CPU and Mac ARM).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The AGPL-3.0 license imposes significant obligations on users who modify or distribute the software, potentially requiring them to open-source their own code. Specific VLM processing controls are mentioned but not detailed in the README.
2 days ago
1 day