VisRAG  by OpenBMB

VLM-based RAG pipeline for multi-modality documents

created 9 months ago
761 stars

Top 46.6% on sourcepulse

GitHubView on GitHub
Project Summary

VisRAG is a novel vision-language model (VLM)-based Retrieval-Augmented Generation (RAG) pipeline designed for multi-modality documents. It addresses information loss in traditional text-based RAG by directly embedding documents as images using VLMs, enabling more comprehensive data utilization. The target audience includes researchers and developers working with document understanding and VLM applications.

How It Works

VisRAG comprises two main components: VisRAG-Ret for retrieval and VisRAG-Gen for generation. VisRAG-Ret utilizes VLMs like MiniCPM-V 2.0 to embed entire documents as images, bypassing the need for text parsing. This approach preserves rich visual and layout information lost in traditional OCR-based methods. VisRAG-Gen then leverages VLMs (including GPT-4o) to generate responses based on the retrieved visual document representations.

Quick Start & Requirements

  • Install: Clone the repository, create a Conda environment with Python 3.10.8, install CUDA toolkit (11.8.0), and run pip install -r requirements.txt followed by pip install -e . and pip install -e ./timm_modified.
  • Prerequisites: CUDA 11.8.0, Python 3.10.8.
  • Resources: Training requires a significant dataset (362,110 Q-D pairs) and potentially distributed training setups (Deepspeed config provided).
  • Links: VisRAG Pipeline, Colab Demo, Paper, Hugging Face Models.

Highlighted Details

  • Parsing-free RAG approach using VLMs for direct image embedding of documents.
  • Supports multiple VLM generators (e.g., MiniCPM-V 2.0, MiniCPM-V 2.6, GPT-4o).
  • Training data includes academic datasets and synthetically generated web-crawled PDF data with VLM-generated queries.
  • Evaluation supports various multi-modal QA datasets like ArxivQA, ChartQA, and PlotQA.

Maintenance & Community

Licensing & Compatibility

  • Code licensed under Apache-2.0.
  • VisRAG-Ret model weights follow MiniCPM Model License.md.
  • Weights are free for academic research. Commercial use requires registration via a questionnaire, after which it is also free.

Limitations & Caveats

  • The timm_modified library is an enhanced version required for training, suggesting potential dependency management considerations.
  • Training data requires manual merging and shuffling if using both in-domain and synthetic datasets.
Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
79 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.