Awesome_Think_With_Images by zhaochen0110

Survey of "Thinking with Images" in multimodal AI

Created 1 year ago

1,487 stars

Top 26.9% on SourcePulse

Project Summary

This repository curates resources and papers for "Thinking with Images," a paradigm shift in multimodal AI where vision acts as a dynamic cognitive workspace for reasoning, planning, and generation. It targets researchers, developers, and enthusiasts interested in advanced AI capabilities that move beyond static visual perception.

How It Works

The project structures research along a trajectory of increasing cognitive autonomy in Large Vision-Language Models (LVLMs). It categorizes papers into three stages: Tool-Driven Visual Exploration (models orchestrating external visual tools), Programmatic Visual Manipulation (models generating code for custom visual analyses), and Intrinsic Visual Imagination (models generating internal visual representations). This taxonomy provides a systematic overview of the evolving capabilities in multimodal AI.

Quick Start & Requirements

This repository is a curated list of research papers and does not require installation or execution. It serves as a reference guide.

Highlighted Details

Comprehensive survey paper "Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers" released.
OpenThinkIMG framework released, an end-to-end open-source framework for LVLMs to "think with images," with Docker support.
Papers are organized into three stages: Tool-Driven Visual Exploration, Programmatic Visual Manipulation, and Intrinsic Visual Imagination.
Includes a dedicated section for Evaluation & Benchmarks relevant to "Thinking with Images" capabilities.

Maintenance & Community

Contributions are welcome via pull requests. The repository is actively maintained, with the last commit in July 2025. Citation information for the accompanying survey paper is provided.

Licensing & Compatibility

The repository is licensed under the MIT License, permitting broad use and modification.

Limitations & Caveats

This repository is a curated list of research papers and does not provide executable code or models directly, other than referencing the OpenThinkIMG framework. The field is rapidly evolving, and the list may not be exhaustive.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days