vision-is-all-you-need by Softlandia-Ltd

Vision RAG demo using serverless Modal + FastAPI + React

Created 1 year ago

403 stars

Top 72.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Andre Zayarni

Cofounder of Qdrant

Project Summary

This project demonstrates a Vision Retrieval Augmented Generation (V-RAG) architecture that bypasses traditional text chunking by using a Vision Language Model (VLM) to embed entire PDF pages as vectors. It's designed for developers and researchers exploring novel RAG techniques, offering a serverless, API-driven approach to document understanding.

How It Works

PDF pages are converted to images and then embedded using a VLM (ColPali in the demo). These embeddings are stored in Qdrant. User queries are also embedded and used to retrieve relevant image embeddings from Qdrant. The original query and the retrieved page images are then fed to a multimodal model (GPT-4o/GPT-4o-mini) to generate a contextually relevant response. This method aims to preserve visual context lost in text-based chunking.

Quick Start & Requirements

Install: pip install modal then modal setup and modal serve main.py.
Prerequisites: Python 3.11+, Hugging Face account (transformers-cli login), OpenAI API key.
Deployment: Uses Modal for serverless GPU (A10G) execution.
Frontend: Requires Node.js for local development (npm install, npm run dev).
Docs: API documentation available at /docs endpoint after serving.

Highlighted Details

Serverless deployment via Modal.
Utilizes ColPali for PDF page embedding.
Leverages GPT-4o/GPT-4o-mini for multimodal understanding.
Integrates Qdrant as the vector database.
Offers both API and a local frontend for interaction.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The demo is explicitly labeled as a demo and may not be production-ready. The current implementation uses an in-memory vector database, which is not persistent. The performance is dependent on the Modal GPU configuration.

Health Check

Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days