Multimodal-RAG-Survey by llm-lab-org

Survey of multimodal retrieval-augmented generation

Created 1 year ago

482 stars

Top 63.7% on SourcePulse

Project Summary

This repository provides a comprehensive survey of Multimodal Retrieval-Augmented Generation (RAG), cataloging research papers, datasets, and methodologies. It serves as a valuable resource for researchers and practitioners aiming to understand and advance the field of RAG systems that integrate multiple data modalities like text, images, audio, and video.

How It Works

The survey categorizes Multimodal RAG systems based on their pipeline, taxonomy of advances, and application domains. It meticulously reviews retrieval strategies, multimodal encoders, modality-specific retrieval techniques (text, vision, video, audio), document understanding, re-ranking, filtering, fusion mechanisms, augmentation techniques, and generation strategies. This structured approach allows for a deep dive into the nuances and challenges of cross-modal information integration for enhanced generative AI.

Quick Start & Requirements

This repository is primarily a curated list of research papers and datasets. There are no direct installation or execution commands for a software component. Users are expected to access the linked papers and datasets for their own research and development.

Highlighted Details

Extensive taxonomy covering retrieval strategies, multimodal encoders, modality-centric retrieval, re-ranking, fusion, augmentation, and generation techniques.
Detailed overview of popular datasets across image-text, video-text, audio-text, medical, fashion, QA, and other domains, with statistics and links.
Comprehensive list of related survey papers and specific research papers categorized by their contribution to Multimodal RAG.
Discussion of various tasks addressed by Multimodal RAG and relevant evaluation metrics.

Maintenance & Community

The repository is actively maintained, with updates to the survey paper and repository content to reflect the rapid growth of the field. The latest version of the paper is available on arXiv, and it has been accepted for ACL 2025 Findings. Contact information for inquiries is provided.

Licensing & Compatibility

The repository itself does not appear to have a specific software license mentioned. The content is presented as a survey and resource list, implying it is for informational and research purposes. Compatibility for commercial use would depend on the licenses of the individual papers and datasets referenced.

Limitations & Caveats

The README states that it is a work in progress and will be completed soon, indicating that the content may be subject to further additions and refinements. While comprehensive, the repository itself does not provide executable code or models, requiring users to engage with external resources.

Multimodal-RAG-Survey by llm-lab-org

Explore Similar Projects

bc-omni by westlake-baichuan-mllm

Awesome-RAG-Vision by zhengxuJosh

CM3Leon by kyegomez

Awesome-Multimodality by Yutong-Zhou-cv

gill by kohjingyu

VARAG by adithya-s-k

fashion-clip by patrickjohncyh

Awesome-CLIP by yzhuoning

Awesome-LLM-RAG by jxzhangjhu

advanced-rag by guyernest

FlagEmbedding by FlagOpen

RAG-Anything by HKUDS