ViDoRAG  by Alibaba-NLP

RAG framework for visual documents, using dynamic iterative reasoning agents

Created 10 months ago
619 stars

Top 53.4% on SourcePulse

GitHubView on GitHub
Project Summary

ViDoRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for complex reasoning over visually rich documents. It targets researchers and developers working with document understanding and question-answering systems, offering improved noise robustness and state-of-the-art performance on visually complex datasets.

How It Works

ViDoRAG employs a multi-agent, actor-critic paradigm for iterative reasoning, enhancing robustness against noisy inputs. A key innovation is its Gaussian Mixture Model (GMM)-based multi-modal hybrid retrieval strategy, which dynamically integrates visual and textual information pipelines for more effective document retrieval.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n vidorag python=3.10), and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.10, Git LFS for dataset download. Recommended to follow Colpali-engine and Transformer library guidance for optimal dependency versions.
  • Dataset: ViDoSeek dataset is available via Git LFS. Scripts are provided to convert PDFs to images and optionally apply OCR or VLMs.
  • Resources: Building the index database involves multi-modal embedding, which can be resource-intensive.
  • Links: ViDoRAG GitHub, ViDoSeek Dataset

Highlighted Details

  • Introduces the ViDoSeek benchmark for visually rich document retrieval-reason-answer tasks.
  • Achieves over 10% improvement on ViDoSeek, establishing a new state-of-the-art.
  • Supports integration of various embedding models for custom retriever creation.
  • Provides evaluation code for customizing assessment pipelines.

Maintenance & Community

The project is associated with Alibaba-NLP. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. The citation lists the journal as "arXiv preprint arXiv:2502.18017", suggesting it is a research paper. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is presented as a research contribution, and its stability, long-term maintenance, and production-readiness are not detailed. Specific hardware requirements for efficient multi-modal embedding are not listed.

Health Check
Last Commit

22 hours ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
6 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nir Gazit Nir Gazit(Cofounder of Traceloop), and
4 more.

llmware by llmware-ai

0.1%
14k
Framework for enterprise RAG pipelines using small, specialized models
Created 2 years ago
Updated 4 days ago
Feedback? Help us improve.