Visual RAG system for enterprise document understanding
Top 45.5% on sourcepulse
LAYRA is a visual-first Retrieval-Augmented Generation (RAG) system designed to understand documents holistically, preserving layout, semantics, and graphical elements. It targets researchers and enterprises needing to bridge unstructured document understanding with multimodal AI, offering a next-generation solution beyond traditional OCR-based RAG.
How It Works
LAYRA processes documents using pure visual embeddings, treating each page as a visual artifact rather than a sequence of tokens. This approach, powered by the Colpali project and its colqwen2.5 model, captures layout structure, tabular integrity, and embedded visuals like plots and diagrams. These visual embeddings are stored in Milvus for efficient retrieval, enabling layout-aware question answering. The system utilizes an async-first backend with FastAPI and supports multimodal LLMs like Qwen2.5-VL, with plans for GPT-4o and Claude.
Quick Start & Requirements
.env
, .env.local
, gunicorn_config.py
), launch dependencies via Docker Compose (milvus-standalone-docker-compose.yml
, docker-compose.yml
), install Python 3.10.6, install system dependencies (poppler-utils
), install Python dependencies (pip install -r requirements.txt
), download ColQwen2.5 model weights, initialize MySQL, start backend (gunicorn
), and start embedding model server (python model_server.py
). Frontend development requires npm install
and npm run dev
(or build
/start
).poppler-utils
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
6 days ago
Inactive