Vision RAG demo using serverless Modal + FastAPI + React
Top 75.2% on sourcepulse
This project demonstrates a Vision Retrieval Augmented Generation (V-RAG) architecture that bypasses traditional text chunking by using a Vision Language Model (VLM) to embed entire PDF pages as vectors. It's designed for developers and researchers exploring novel RAG techniques, offering a serverless, API-driven approach to document understanding.
How It Works
PDF pages are converted to images and then embedded using a VLM (ColPali in the demo). These embeddings are stored in Qdrant. User queries are also embedded and used to retrieve relevant image embeddings from Qdrant. The original query and the retrieved page images are then fed to a multimodal model (GPT-4o/GPT-4o-mini) to generate a contextually relevant response. This method aims to preserve visual context lost in text-based chunking.
Quick Start & Requirements
pip install modal
then modal setup
and modal serve main.py
.transformers-cli login
), OpenAI API key.npm install
, npm run dev
)./docs
endpoint after serving.Highlighted Details
Maintenance & Community
No specific details on contributors, sponsorships, or community channels are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The demo is explicitly labeled as a demo and may not be production-ready. The current implementation uses an in-memory vector database, which is not persistent. The performance is dependent on the Modal GPU configuration.
1 month ago
Inactive