Discover and explore top open-source AI tools and projects—updated daily.
zhengxuJoshSurvey and resource hub for multimodal spatial reasoning
Top 97.7% on SourcePulse
This repository acts as a comprehensive, curated collection of state-of-the-art research papers on spatial reasoning within Multimodal Vision-Language Models (MVLMs). It organizes and highlights advancements in this critical AI subfield, offering a valuable resource for researchers and practitioners seeking to understand current frontiers and benchmarks.
How It Works
The project functions as a living survey, systematically categorizing and linking to seminal and recent papers across diverse facets of multimodal spatial reasoning. It covers classical 2D and advanced 3D spatial understanding, scene/layout reasoning, visual question answering, and extends to embodied AI tasks like vision-language navigation. The survey also incorporates novel sensor modalities such as audio and ego-centric video, aiming to provide a holistic view of the field.
Quick Start & Requirements
This repository is a curated list of research papers, not a runnable software project. Contribution is via GitHub Pull Requests to add new papers or resources. No installation or specific hardware requirements are applicable for users of the list itself. Links to related surveys, benchmarks, and specific research areas (3D Vision, Embodied AI, General MLLM, Sound/Audio/Egocentric, Spatial Benchmark) are provided within the README.
Highlighted Details
Maintenance & Community
The project is maintained through community contributions via GitHub Pull Requests. No specific community channels (like Discord/Slack) or roadmap are detailed in the README.
Licensing & Compatibility
The project is licensed under the MIT License, which generally permits broad use, modification, and distribution, including for commercial purposes, with attribution.
Limitations & Caveats
As a curated list, the repository's "completeness" is dependent on community contributions. It focuses on research papers and benchmarks, not on providing a unified software framework or pre-trained models for direct implementation. The "state-of-the-art" is a moving target, and the survey's recency depends on ongoing updates.
6 days ago
Inactive