vision-is-all-you-need  by Softlandia-Ltd

Vision RAG demo using serverless Modal + FastAPI + React

Created 1 year ago
392 stars

Top 73.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This project demonstrates a Vision Retrieval Augmented Generation (V-RAG) architecture that bypasses traditional text chunking by using a Vision Language Model (VLM) to embed entire PDF pages as vectors. It's designed for developers and researchers exploring novel RAG techniques, offering a serverless, API-driven approach to document understanding.

How It Works

PDF pages are converted to images and then embedded using a VLM (ColPali in the demo). These embeddings are stored in Qdrant. User queries are also embedded and used to retrieve relevant image embeddings from Qdrant. The original query and the retrieved page images are then fed to a multimodal model (GPT-4o/GPT-4o-mini) to generate a contextually relevant response. This method aims to preserve visual context lost in text-based chunking.

Quick Start & Requirements

  • Install: pip install modal then modal setup and modal serve main.py.
  • Prerequisites: Python 3.11+, Hugging Face account (transformers-cli login), OpenAI API key.
  • Deployment: Uses Modal for serverless GPU (A10G) execution.
  • Frontend: Requires Node.js for local development (npm install, npm run dev).
  • Docs: API documentation available at /docs endpoint after serving.

Highlighted Details

  • Serverless deployment via Modal.
  • Utilizes ColPali for PDF page embedding.
  • Leverages GPT-4o/GPT-4o-mini for multimodal understanding.
  • Integrates Qdrant as the vector database.
  • Offers both API and a local frontend for interaction.

Maintenance & Community

No specific details on contributors, sponsorships, or community channels are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The demo is explicitly labeled as a demo and may not be production-ready. The current implementation uses an in-memory vector database, which is not persistent. The performance is dependent on the Modal GPU configuration.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.