Multimodal-RAG-Survey  by llm-lab-org

Survey of multimodal retrieval-augmented generation

Created 7 months ago
353 stars

Top 79.0% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive survey of Multimodal Retrieval-Augmented Generation (RAG), cataloging research papers, datasets, and methodologies. It serves as a valuable resource for researchers and practitioners aiming to understand and advance the field of RAG systems that integrate multiple data modalities like text, images, audio, and video.

How It Works

The survey categorizes Multimodal RAG systems based on their pipeline, taxonomy of advances, and application domains. It meticulously reviews retrieval strategies, multimodal encoders, modality-specific retrieval techniques (text, vision, video, audio), document understanding, re-ranking, filtering, fusion mechanisms, augmentation techniques, and generation strategies. This structured approach allows for a deep dive into the nuances and challenges of cross-modal information integration for enhanced generative AI.

Quick Start & Requirements

This repository is primarily a curated list of research papers and datasets. There are no direct installation or execution commands for a software component. Users are expected to access the linked papers and datasets for their own research and development.

Highlighted Details

  • Extensive taxonomy covering retrieval strategies, multimodal encoders, modality-centric retrieval, re-ranking, fusion, augmentation, and generation techniques.
  • Detailed overview of popular datasets across image-text, video-text, audio-text, medical, fashion, QA, and other domains, with statistics and links.
  • Comprehensive list of related survey papers and specific research papers categorized by their contribution to Multimodal RAG.
  • Discussion of various tasks addressed by Multimodal RAG and relevant evaluation metrics.

Maintenance & Community

The repository is actively maintained, with updates to the survey paper and repository content to reflect the rapid growth of the field. The latest version of the paper is available on arXiv, and it has been accepted for ACL 2025 Findings. Contact information for inquiries is provided.

Licensing & Compatibility

The repository itself does not appear to have a specific software license mentioned. The content is presented as a survey and resource list, implying it is for informational and research purposes. Compatibility for commercial use would depend on the licenses of the individual papers and datasets referenced.

Limitations & Caveats

The README states that it is a work in progress and will be completed soon, indicating that the content may be subject to further additions and refinements. While comprehensive, the repository itself does not provide executable code or models, requiring users to engage with external resources.

Health Check
Last Commit

4 weeks ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
52 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.