ViDoRAG  by Alibaba-NLP

RAG framework for visual documents, using dynamic iterative reasoning agents

created 5 months ago
522 stars

Top 61.2% on sourcepulse

GitHubView on GitHub
Project Summary

ViDoRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for complex reasoning over visually rich documents. It targets researchers and developers working with document understanding and question-answering systems, offering improved noise robustness and state-of-the-art performance on visually complex datasets.

How It Works

ViDoRAG employs a multi-agent, actor-critic paradigm for iterative reasoning, enhancing robustness against noisy inputs. A key innovation is its Gaussian Mixture Model (GMM)-based multi-modal hybrid retrieval strategy, which dynamically integrates visual and textual information pipelines for more effective document retrieval.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n vidorag python=3.10), and install requirements (pip install -r requirements.txt).
  • Prerequisites: Python 3.10, Git LFS for dataset download. Recommended to follow Colpali-engine and Transformer library guidance for optimal dependency versions.
  • Dataset: ViDoSeek dataset is available via Git LFS. Scripts are provided to convert PDFs to images and optionally apply OCR or VLMs.
  • Resources: Building the index database involves multi-modal embedding, which can be resource-intensive.
  • Links: ViDoRAG GitHub, ViDoSeek Dataset

Highlighted Details

  • Introduces the ViDoSeek benchmark for visually rich document retrieval-reason-answer tasks.
  • Achieves over 10% improvement on ViDoSeek, establishing a new state-of-the-art.
  • Supports integration of various embedding models for custom retriever creation.
  • Provides evaluation code for customizing assessment pipelines.

Maintenance & Community

The project is associated with Alibaba-NLP. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. The citation lists the journal as "arXiv preprint arXiv:2502.18017", suggesting it is a research paper. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is presented as a research contribution, and its stability, long-term maintenance, and production-readiness are not detailed. Specific hardware requirements for efficient multi-modal embedding are not listed.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
67 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.