Discover and explore top open-source AI tools and projects—updated daily.
Survey of "Thinking with Images" in multimodal AI
Top 38.5% on SourcePulse
This repository curates resources and papers for "Thinking with Images," a paradigm shift in multimodal AI where vision acts as a dynamic cognitive workspace for reasoning, planning, and generation. It targets researchers, developers, and enthusiasts interested in advanced AI capabilities that move beyond static visual perception.
How It Works
The project structures research along a trajectory of increasing cognitive autonomy in Large Vision-Language Models (LVLMs). It categorizes papers into three stages: Tool-Driven Visual Exploration (models orchestrating external visual tools), Programmatic Visual Manipulation (models generating code for custom visual analyses), and Intrinsic Visual Imagination (models generating internal visual representations). This taxonomy provides a systematic overview of the evolving capabilities in multimodal AI.
Quick Start & Requirements
This repository is a curated list of research papers and does not require installation or execution. It serves as a reference guide.
Highlighted Details
Maintenance & Community
Contributions are welcome via pull requests. The repository is actively maintained, with the last commit in July 2025. Citation information for the accompanying survey paper is provided.
Licensing & Compatibility
The repository is licensed under the MIT License, permitting broad use and modification.
Limitations & Caveats
This repository is a curated list of research papers and does not provide executable code or models directly, other than referencing the OpenThinkIMG framework. The field is rapidly evolving, and the list may not be exhaustive.
2 weeks ago
Inactive