RAG framework for visual documents, using dynamic iterative reasoning agents
Top 61.2% on sourcepulse
ViDoRAG is a novel Retrieval-Augmented Generation (RAG) framework designed for complex reasoning over visually rich documents. It targets researchers and developers working with document understanding and question-answering systems, offering improved noise robustness and state-of-the-art performance on visually complex datasets.
How It Works
ViDoRAG employs a multi-agent, actor-critic paradigm for iterative reasoning, enhancing robustness against noisy inputs. A key innovation is its Gaussian Mixture Model (GMM)-based multi-modal hybrid retrieval strategy, which dynamically integrates visual and textual information pipelines for more effective document retrieval.
Quick Start & Requirements
conda create -n vidorag python=3.10
), and install requirements (pip install -r requirements.txt
).Highlighted Details
Maintenance & Community
The project is associated with Alibaba-NLP. Further community or maintenance details are not explicitly detailed in the README.
Licensing & Compatibility
The README does not explicitly state a license. The citation lists the journal as "arXiv preprint arXiv:2502.18017", suggesting it is a research paper. Compatibility for commercial or closed-source use is not specified.
Limitations & Caveats
The framework is presented as a research contribution, and its stability, long-term maintenance, and production-readiness are not detailed. Specific hardware requirements for efficient multi-modal embedding are not listed.
1 month ago
1 day