layra by liweiphys

Visual RAG system for enterprise document understanding

Created 9 months ago

893 stars

Top 40.5% on SourcePulse

Project Summary

LAYRA is a visual-first Retrieval-Augmented Generation (RAG) system designed to understand documents holistically, preserving layout, semantics, and graphical elements. It targets researchers and enterprises needing to bridge unstructured document understanding with multimodal AI, offering a next-generation solution beyond traditional OCR-based RAG.

How It Works

LAYRA processes documents using pure visual embeddings, treating each page as a visual artifact rather than a sequence of tokens. This approach, powered by the Colpali project and its colqwen2.5 model, captures layout structure, tabular integrity, and embedded visuals like plots and diagrams. These visual embeddings are stored in Milvus for efficient retrieval, enabling layout-aware question answering. The system utilizes an async-first backend with FastAPI and supports multimodal LLMs like Qwen2.5-VL, with plans for GPT-4o and Claude.

Quick Start & Requirements

Install/Run: Clone the repository, set up environment variables (.env, .env.local, gunicorn_config.py), launch dependencies via Docker Compose (milvus-standalone-docker-compose.yml, docker-compose.yml), install Python 3.10.6, install system dependencies (poppler-utils), install Python dependencies (pip install -r requirements.txt), download ColQwen2.5 model weights, initialize MySQL, start backend (gunicorn), and start embedding model server (python model_server.py). Frontend development requires npm install and npm run dev (or build/start).
Prerequisites: Python 3.10.6, Git LFS, Milvus, Redis, MongoDB, Kafka, MinIO (via Docker Compose), poppler-utils.
Setup Time: Significant setup involving cloning, environment configuration, Docker Compose setup, model downloads, and database initialization.
Links: GitHub Repo

Highlighted Details

Visual-first RAG without OCR, preserving layout and visual content.
Modern frontend (Next.js, TypeScript, TailwindCSS) and async backend (FastAPI, Redis, MySQL, MongoDB, MinIO).
Uses Colpali project with colqwen2.5 for visual embeddings stored in Milvus.
Supports Qwen2.5-VL, with planned support for GPT-4o, Claude, and Gemini.
Currently supports PDF documents; future releases will include Word, PPT, Excel, and images.

Maintenance & Community

Project is under active development with a first trial version available.
Contact: liweiphys (email: liweixmu@foxmail.com), GitHub: github.com/liweiphys/layra.
Roadmap available.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Permissive license suitable for commercial use and closed-source linking.

Limitations & Caveats

Currently in active development and supports only PDF documents.
Requires significant setup with multiple dependencies and model downloads.
Future releases are planned for broader document format support and additional LLM integrations.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

tiny-rag by wdndev

Tiny RAG system for retrieval-augmented LLM

Created 1 year ago

Updated 8 months ago

VARAG by adithya-s-k

Vision-first RAG engine for multimodal document understanding

Created 1 year ago

Updated 5 months ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera) and

Philipp Schmid

Philipp Schmid(DevRel at Google DeepMind).

llm-search by snexus

Advanced RAG system for local document interaction

Created 2 years ago

Updated 5 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Ovis by AIDC-AI

MLLM architecture aligning visual/textual embeddings

Created 1 year ago

Updated 3 months ago

localGPT-Vision by PromtEngineer

Vision-language RAG pipeline for document Q&A

Created 1 year ago

Updated 5 months ago

genai-quickstart-pocs by aws-samples

GenAI PoCs using Amazon Bedrock, SDKs, and Streamlit/Blazor frontends

Created 1 year ago

Updated 2 days ago

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera),

Didier Lopes

Didier Lopes(Founder of OpenBB), and

3 more.

colpali by illuin-tech

Vision-language model code for document retrieval research

Created 1 year ago

Updated 3 days ago

langchain4j-aideepin by moyangzhan

AI productivity tools for chat, drawing, RAG, and workflows

Created 2 years ago

Updated 3 days ago

rag_api by danny-avila

Async API for ID-based RAG using Langchain

Created 1 year ago

Updated 1 week ago

LangChain-ChatGLM-Webui by X-D-Lab

WebUI for local knowledge-based Q\&A using LangChain and ChatGLM

Created 2 years ago

Updated 1 year ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

WeKnora by Tencent

LLM framework for deep document understanding and RAG

Created 5 months ago

Updated 2 days ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

MoE vision-language model for multimodal understanding

Created 1 year ago

Updated 10 months ago

Feedback? Help us improve.