VARAG by adithya-s-k

Vision-first RAG engine for multimodal document understanding

Created 1 year ago

491 stars

Top 63.0% on SourcePulse

Project Summary

VARAG is a vision-first Retrieval-Augmented Generation (RAG) engine designed for users needing to integrate visual and textual data for enhanced information retrieval. It offers a flexible abstraction layer to experiment with various RAG techniques, including text, image, and multimodal document retrieval, simplifying the evaluation of different approaches for diverse use cases.

How It Works

VARAG integrates vision-language models to embed both visual and textual data into a shared vector space, enabling cross-modal similarity searches. It supports several retrieval methods: Simple RAG with OCR for text-heavy documents, Vision RAG using cross-modal embeddings for text-image correlation, ColPali RAG which embeds document pages as images for visual-aware retrieval, and Hybrid ColPali RAG combining image embeddings with ColPali's late interaction for re-ranking. This modular design, inspired by Byaldi, uses LanceDB for vector storage, facilitating rapid experimentation.

Quick Start & Requirements

Install: Clone the repository, create a Conda environment (conda create -n varag-venv python=3.10), activate it (conda activate varag-venv), and install dependencies (pip install -e . or poetry install). OCR dependencies can be installed with pip install -e .["ocr"].
Demo: Run python demo.py --share for an interactive playground.
Prerequisites: Python 3.10, Conda.

Highlighted Details

Supports text, image, and multimodal retrieval.
Implements ColPali RAG, leveraging PaliGemma for image-based document page embedding and late interaction.
Integrates OCR via Docling for scanned documents.
Uses LanceDB as the vector store for ease of use and customizability.

Maintenance & Community

The project is open for contributions. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and modification.

Limitations & Caveats

The project is presented as an experimental framework for evaluating RAG techniques rather than a production-ready library. Specific performance benchmarks or detailed comparisons between the implemented techniques are not explicitly provided in the README.

VARAG by adithya-s-k

Explore Similar Projects

Multimodal-RAG-Survey by llm-lab-org

tiny-rag by wdndev

gill by kohjingyu

vision-is-all-you-need by Softlandia-Ltd

Ovis by AIDC-AI

layra by liweiphys

OpenAI-CLIP by moein-shariatnia

localGPT-Vision by PromtEngineer

mPLUG-DocOwl by X-PLUG

colpali by illuin-tech

RAG-Anything by HKUDS

DeepSeek-VL2 by deepseek-ai