ViDoRAG  by Alibaba-NLP

RAG framework for visual documents, using dynamic iterative reasoning agents

Created 6 months ago
540 stars

Top 58.9% on SourcePulse

GitHubView on GitHub
Project Summary

ViDoRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for complex reasoning over visually rich documents. It targets researchers and developers working with document understanding and question-answering systems, offering improved noise robustness and state-of-the-art performance on visually complex datasets.

How It Works

ViDoRAG employs a multi-agent, actor-critic paradigm for iterative reasoning, enhancing robustness against noisy inputs. A key innovation is its Gaussian Mixture Model (GMM)-based multi-modal hybrid retrieval strategy, which dynamically integrates visual and textual information pipelines for more effective document retrieval.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n vidorag python=3.10), and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.10, Git LFS for dataset download. Recommended to follow Colpali-engine and Transformer library guidance for optimal dependency versions.
  • Dataset: ViDoSeek dataset is available via Git LFS. Scripts are provided to convert PDFs to images and optionally apply OCR or VLMs.
  • Resources: Building the index database involves multi-modal embedding, which can be resource-intensive.
  • Links: ViDoRAG GitHub, ViDoSeek Dataset

Highlighted Details

  • Introduces the ViDoSeek benchmark for visually rich document retrieval-reason-answer tasks.
  • Achieves over 10% improvement on ViDoSeek, establishing a new state-of-the-art.
  • Supports integration of various embedding models for custom retriever creation.
  • Provides evaluation code for customizing assessment pipelines.

Maintenance & Community

The project is associated with Alibaba-NLP. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. The citation lists the journal as "arXiv preprint arXiv:2502.18017", suggesting it is a research paper. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is presented as a research contribution, and its stability, long-term maintenance, and production-readiness are not detailed. Specific hardware requirements for efficient multi-modal embedding are not listed.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nir Gazit Nir Gazit(Cofounder of Traceloop), and
4 more.

llmware by llmware-ai

0.6%
14k
Framework for enterprise RAG pipelines using small, specialized models
Created 2 years ago
Updated 1 month ago
Feedback? Help us improve.