VARAG  by adithya-s-k

Vision-first RAG engine for multimodal document understanding

Created 1 year ago
481 stars

Top 63.7% on SourcePulse

GitHubView on GitHub
Project Summary

VARAG is a vision-first Retrieval-Augmented Generation (RAG) engine designed for users needing to integrate visual and textual data for enhanced information retrieval. It offers a flexible abstraction layer to experiment with various RAG techniques, including text, image, and multimodal document retrieval, simplifying the evaluation of different approaches for diverse use cases.

How It Works

VARAG integrates vision-language models to embed both visual and textual data into a shared vector space, enabling cross-modal similarity searches. It supports several retrieval methods: Simple RAG with OCR for text-heavy documents, Vision RAG using cross-modal embeddings for text-image correlation, ColPali RAG which embeds document pages as images for visual-aware retrieval, and Hybrid ColPali RAG combining image embeddings with ColPali's late interaction for re-ranking. This modular design, inspired by Byaldi, uses LanceDB for vector storage, facilitating rapid experimentation.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment (conda create -n varag-venv python=3.10), activate it (conda activate varag-venv), and install dependencies (pip install -e . or poetry install). OCR dependencies can be installed with pip install -e .["ocr"].
  • Demo: Run python demo.py --share for an interactive playground.
  • Prerequisites: Python 3.10, Conda.

Highlighted Details

  • Supports text, image, and multimodal retrieval.
  • Implements ColPali RAG, leveraging PaliGemma for image-based document page embedding and late interaction.
  • Integrates OCR via Docling for scanned documents.
  • Uses LanceDB as the vector store for ease of use and customizability.

Maintenance & Community

The project is open for contributions. Contact is available via email at adithyaskolavi@gmail.com.

Licensing & Compatibility

Licensed under the MIT License, permitting commercial use and modification.

Limitations & Caveats

The project is presented as an experimental framework for evaluating RAG techniques rather than a production-ready library. Specific performance benchmarks or detailed comparisons between the implemented techniques are not explicitly provided in the README.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.