layra  by liweiphys

Visual RAG system for enterprise document understanding

Created 5 months ago
808 stars

Top 43.8% on SourcePulse

GitHubView on GitHub
Project Summary

LAYRA is a visual-first Retrieval-Augmented Generation (RAG) system designed to understand documents holistically, preserving layout, semantics, and graphical elements. It targets researchers and enterprises needing to bridge unstructured document understanding with multimodal AI, offering a next-generation solution beyond traditional OCR-based RAG.

How It Works

LAYRA processes documents using pure visual embeddings, treating each page as a visual artifact rather than a sequence of tokens. This approach, powered by the Colpali project and its colqwen2.5 model, captures layout structure, tabular integrity, and embedded visuals like plots and diagrams. These visual embeddings are stored in Milvus for efficient retrieval, enabling layout-aware question answering. The system utilizes an async-first backend with FastAPI and supports multimodal LLMs like Qwen2.5-VL, with plans for GPT-4o and Claude.

Quick Start & Requirements

  • Install/Run: Clone the repository, set up environment variables (.env, .env.local, gunicorn_config.py), launch dependencies via Docker Compose (milvus-standalone-docker-compose.yml, docker-compose.yml), install Python 3.10.6, install system dependencies (poppler-utils), install Python dependencies (pip install -r requirements.txt), download ColQwen2.5 model weights, initialize MySQL, start backend (gunicorn), and start embedding model server (python model_server.py). Frontend development requires npm install and npm run dev (or build/start).
  • Prerequisites: Python 3.10.6, Git LFS, Milvus, Redis, MongoDB, Kafka, MinIO (via Docker Compose), poppler-utils.
  • Setup Time: Significant setup involving cloning, environment configuration, Docker Compose setup, model downloads, and database initialization.
  • Links: GitHub Repo

Highlighted Details

  • Visual-first RAG without OCR, preserving layout and visual content.
  • Modern frontend (Next.js, TypeScript, TailwindCSS) and async backend (FastAPI, Redis, MySQL, MongoDB, MinIO).
  • Uses Colpali project with colqwen2.5 for visual embeddings stored in Milvus.
  • Supports Qwen2.5-VL, with planned support for GPT-4o, Claude, and Gemini.
  • Currently supports PDF documents; future releases will include Word, PPT, Excel, and images.

Maintenance & Community

  • Project is under active development with a first trial version available.
  • Contact: liweiphys (email: liweixmu@foxmail.com), GitHub: github.com/liweiphys/layra.
  • Roadmap available.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and closed-source linking.

Limitations & Caveats

  • Currently in active development and supports only PDF documents.
  • Requires significant setup with multiple dependencies and model downloads.
  • Future releases are planned for broader document format support and additional LLM integrations.
Health Check
Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
15 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.