ViDoRAG by Alibaba-NLP

RAG framework for visual documents, using dynamic iterative reasoning agents

Created 8 months ago

590 stars

Top 55.0% on SourcePulse

Project Summary

ViDoRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for complex reasoning over visually rich documents. It targets researchers and developers working with document understanding and question-answering systems, offering improved noise robustness and state-of-the-art performance on visually complex datasets.

How It Works

ViDoRAG employs a multi-agent, actor-critic paradigm for iterative reasoning, enhancing robustness against noisy inputs. A key innovation is its Gaussian Mixture Model (GMM)-based multi-modal hybrid retrieval strategy, which dynamically integrates visual and textual information pipelines for more effective document retrieval.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n vidorag python=3.10), and install requirements (pip install -r requirements.txt).
Prerequisites: Python 3.10, Git LFS for dataset download. Recommended to follow Colpali-engine and Transformer library guidance for optimal dependency versions.
Dataset: ViDoSeek dataset is available via Git LFS. Scripts are provided to convert PDFs to images and optionally apply OCR or VLMs.
Resources: Building the index database involves multi-modal embedding, which can be resource-intensive.
Links: ViDoRAG GitHub, ViDoSeek Dataset

Highlighted Details

Introduces the ViDoSeek benchmark for visually rich document retrieval-reason-answer tasks.
Achieves over 10% improvement on ViDoSeek, establishing a new state-of-the-art.
Supports integration of various embedding models for custom retriever creation.
Provides evaluation code for customizing assessment pipelines.

Maintenance & Community

The project is associated with Alibaba-NLP. Further community or maintenance details are not explicitly detailed in the README.

Licensing & Compatibility

The README does not explicitly state a license. The citation lists the journal as "arXiv preprint arXiv:2502.18017", suggesting it is a research paper. Compatibility for commercial or closed-source use is not specified.

Limitations & Caveats

The framework is presented as a research contribution, and its stability, long-term maintenance, and production-readiness are not detailed. Specific hardware requirements for efficient multi-modal embedding are not listed.

ViDoRAG by Alibaba-NLP

Explore Similar Projects

multimodal-search-r1 by EvolvingLMMs-Lab

OmniSearch by Alibaba-NLP

wdoc by thiswillbeyourgithub

HiRAG by hhy-huang

layra by liweiphys

granite-snack-cookbook by ibm-granite-community

localGPT-Vision by PromtEngineer

colpali by illuin-tech

elasticsearch-labs by elastic

morphik-core by morphik-org

WeKnora by Tencent

llmware by llmware-ai