Survey paper on large multimodal reasoning models
Top 68.7% on sourcepulse
This repository provides a comprehensive survey of Large Multimodal Reasoning Models (LMRMs), detailing their evolution from modular systems to sophisticated language-centric frameworks. It targets researchers and practitioners in AI, offering a structured overview of LMRMs' capabilities, datasets, benchmarks, and future directions, particularly towards native multimodal reasoning.
How It Works
The survey categorizes LMRMs into three stages: perception-driven reasoning (modular networks, vision-language models), language-centric short reasoning (prompt-based, structural, externally augmented), and language-centric long reasoning (cross-modal, MM-O1, MM-R1). It emphasizes the progression towards "native" LMRMs capable of agentic, omni-modal understanding and generative reasoning.
Quick Start & Requirements
This is a survey repository, not a runnable codebase. It links to numerous research papers and datasets.
Highlighted Details
Maintenance & Community
The repository is maintained by the HITsz-TMG group, with regular updates based on community contributions via issues or email. Contact information for contributors is provided.
Licensing & Compatibility
The repository itself is likely under a permissive license (e.g., MIT, Apache 2.0, common for GitHub projects), but it primarily serves as a curated list of research papers, each with its own licensing.
Limitations & Caveats
As a survey, it does not provide executable code. The rapid pace of LMRM development means some information may become dated quickly, though the repository aims for continuous updates.
1 day ago
Inactive