VLM-based RAG pipeline for multi-modality documents
Top 46.6% on sourcepulse
VisRAG is a novel vision-language model (VLM)-based Retrieval-Augmented Generation (RAG) pipeline designed for multi-modality documents. It addresses information loss in traditional text-based RAG by directly embedding documents as images using VLMs, enabling more comprehensive data utilization. The target audience includes researchers and developers working with document understanding and VLM applications.
How It Works
VisRAG comprises two main components: VisRAG-Ret for retrieval and VisRAG-Gen for generation. VisRAG-Ret utilizes VLMs like MiniCPM-V 2.0 to embed entire documents as images, bypassing the need for text parsing. This approach preserves rich visual and layout information lost in traditional OCR-based methods. VisRAG-Gen then leverages VLMs (including GPT-4o) to generate responses based on the retrieved visual document representations.
Quick Start & Requirements
pip install -r requirements.txt
followed by pip install -e .
and pip install -e ./timm_modified
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
timm_modified
library is an enhanced version required for training, suggesting potential dependency management considerations.5 months ago
1 day