Vision-first RAG engine for multimodal document understanding
Top 64.9% on sourcepulse
VARAG is a vision-first Retrieval-Augmented Generation (RAG) engine designed for users needing to integrate visual and textual data for enhanced information retrieval. It offers a flexible abstraction layer to experiment with various RAG techniques, including text, image, and multimodal document retrieval, simplifying the evaluation of different approaches for diverse use cases.
How It Works
VARAG integrates vision-language models to embed both visual and textual data into a shared vector space, enabling cross-modal similarity searches. It supports several retrieval methods: Simple RAG with OCR for text-heavy documents, Vision RAG using cross-modal embeddings for text-image correlation, ColPali RAG which embeds document pages as images for visual-aware retrieval, and Hybrid ColPali RAG combining image embeddings with ColPali's late interaction for re-ranking. This modular design, inspired by Byaldi, uses LanceDB for vector storage, facilitating rapid experimentation.
Quick Start & Requirements
conda create -n varag-venv python=3.10
), activate it (conda activate varag-venv
), and install dependencies (pip install -e .
or poetry install
). OCR dependencies can be installed with pip install -e .["ocr"]
.python demo.py --share
for an interactive playground.Highlighted Details
Maintenance & Community
The project is open for contributions. Contact is available via email at adithyaskolavi@gmail.com.
Licensing & Compatibility
Licensed under the MIT License, permitting commercial use and modification.
Limitations & Caveats
The project is presented as an experimental framework for evaluating RAG techniques rather than a production-ready library. Specific performance benchmarks or detailed comparisons between the implemented techniques are not explicitly provided in the README.
1 week ago
Inactive