VisRAG by OpenBMB

VLM-based RAG pipeline for multi-modality documents

Created 1 year ago

896 stars

Top 40.4% on SourcePulse

1 Expert Loves This Project

omarsar

Founder of DAIR.AI

Project Summary

VisRAG is a novel vision-language model (VLM)-based Retrieval-Augmented Generation (RAG) pipeline designed for multi-modality documents. It addresses information loss in traditional text-based RAG by directly embedding documents as images using VLMs, enabling more comprehensive data utilization. The target audience includes researchers and developers working with document understanding and VLM applications.

How It Works

VisRAG comprises two main components: VisRAG-Ret for retrieval and VisRAG-Gen for generation. VisRAG-Ret utilizes VLMs like MiniCPM-V 2.0 to embed entire documents as images, bypassing the need for text parsing. This approach preserves rich visual and layout information lost in traditional OCR-based methods. VisRAG-Gen then leverages VLMs (including GPT-4o) to generate responses based on the retrieved visual document representations.

Quick Start & Requirements

Install: Clone the repository, create a Conda environment with Python 3.10.8, install CUDA toolkit (11.8.0), and run pip install -r requirements.txt followed by pip install -e . and pip install -e ./timm_modified.
Prerequisites: CUDA 11.8.0, Python 3.10.8.
Resources: Training requires a significant dataset (362,110 Q-D pairs) and potentially distributed training setups (Deepspeed config provided).
Links: VisRAG Pipeline, Colab Demo, Paper, Hugging Face Models.

Highlighted Details

Parsing-free RAG approach using VLMs for direct image embedding of documents.
Supports multiple VLM generators (e.g., MiniCPM-V 2.0, MiniCPM-V 2.6, GPT-4o).
Training data includes academic datasets and synthetically generated web-crawled PDF data with VLM-generated queries.
Evaluation supports various multi-modal QA datasets like ArxivQA, ChartQA, and PlotQA.

Maintenance & Community

Project initiated by OpenBMB.
Key contacts: Shi Yu (yus21@mails.tsinghua.edu.cn), Chaoyue Tang (tcy006@gmail.com).
Active development with recent releases in late 2024.

Licensing & Compatibility

Code licensed under Apache-2.0.
VisRAG-Ret model weights follow MiniCPM Model License.md.
Weights are free for academic research. Commercial use requires registration via a questionnaire, after which it is also free.

Limitations & Caveats

The timm_modified library is an enhanced version required for training, suggesting potential dependency management considerations.
Training data requires manual merging and shuffling if using both in-domain and synthetic datasets.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

18 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI) and

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

DreamLLM by RunpeiDong

Multimodal LLM framework for comprehension and creation

Created 2 years ago

Updated 1 year ago

MIC by HaozheZhao

VLM for multimodal in-context learning research

Created 2 years ago

Updated 2 years ago

vlmrun-cookbook by vlm-run

Cookbook of examples for structured visual understanding via VLM Run Platform

Created 1 year ago

Updated 1 week ago

tiny-rag by wdndev

Tiny RAG system for retrieval-augmented LLM

Created 1 year ago

Updated 8 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

1 more.

METER by zdou0830

Multimodal framework for vision-and-language transformer research

Created 4 years ago

Updated 3 years ago

VLM2Vec by TIGER-AI-Lab

Research paper for multimodal embeddings using vision-language models

Created 1 year ago

Updated 3 weeks ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

Starred by

Burkay Gur

Burkay Gur(Cofounder of Fal.ai).

awesome-vlm-architectures by gokayfem

Vision-language models and their architectures

Created 1 year ago

Updated 10 months ago

Starred by

Phil Wang

Phil Wang(Prolific Research Paper Implementer).

molmo by allenai

Multimodal open language model code, training, and evaluation

Created 1 year ago

Updated 1 year ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

Ovis by AIDC-AI

MLLM architecture aligning visual/textual embeddings

Created 1 year ago

Updated 3 months ago

Starred by

Jesse Clark

Jesse Clark(Cofounder of Marqo) and

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

mPLUG-DocOwl by X-PLUG

Multimodal LLM for OCR-free document understanding

Created 2 years ago

Updated 7 months ago

minimind-v by jingyaogong

VLM for training vision-language models from scratch

Created 1 year ago

Updated 2 weeks ago

Feedback? Help us improve.