vision-is-all-you-need  by Softlandia-Ltd

Vision RAG demo using serverless Modal + FastAPI + React

created 10 months ago
387 stars

Top 75.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project demonstrates a Vision Retrieval Augmented Generation (V-RAG) architecture that bypasses traditional text chunking by using a Vision Language Model (VLM) to embed entire PDF pages as vectors. It's designed for developers and researchers exploring novel RAG techniques, offering a serverless, API-driven approach to document understanding.

How It Works

PDF pages are converted to images and then embedded using a VLM (ColPali in the demo). These embeddings are stored in Qdrant. User queries are also embedded and used to retrieve relevant image embeddings from Qdrant. The original query and the retrieved page images are then fed to a multimodal model (GPT-4o/GPT-4o-mini) to generate a contextually relevant response. This method aims to preserve visual context lost in text-based chunking.

Quick Start & Requirements

  • Install: pip install modal then modal setup and modal serve main.py.
  • Prerequisites: Python 3.11+, Hugging Face account (transformers-cli login), OpenAI API key.
  • Deployment: Uses Modal for serverless GPU (A10G) execution.
  • Frontend: Requires Node.js for local development (npm install, npm run dev).
  • Docs: API documentation available at /docs endpoint after serving.

Highlighted Details

  • Serverless deployment via Modal.
  • Utilizes ColPali for PDF page embedding.
  • Leverages GPT-4o/GPT-4o-mini for multimodal understanding.
  • Integrates Qdrant as the vector database.
  • Offers both API and a local frontend for interaction.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The demo is explicitly labeled as a demo and may not be production-ready. The current implementation uses an in-memory vector database, which is not persistent. The performance is dependent on the Modal GPU configuration.

Health Check
Last commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.