Framework for LVLMs to "think with images"
Top 90.1% on SourcePulse
OpenThinkIMG is an end-to-end framework enabling Large Vision-Language Models (LVLMs) to interactively "think with images" using a suite of vision tools. It targets researchers and developers seeking to enhance LVLM capabilities in complex visual reasoning and precise interaction tasks, offering a unified platform for tool management and a novel V-ToolRL training method for improved adaptability.
How It Works
OpenThinkIMG standardizes vision tool integration by treating each tool as an independent, modular service. This design promotes scalability and fault isolation. The framework supports both Supervised Fine-Tuning (SFT) and a novel Reinforcement Learning approach, V-ToolRL, which trains agents to dynamically discover optimal tool-usage strategies through interaction and feedback, outperforming static SFT methods.
Quick Start & Requirements
pip install -r tool_server_requirements.txt
and pip install -e .
. Training requires additional dependencies from requirements_train.txt
.accelerate
library. Specific tools may have individual requirements.python start_server_config.py
. Inference and training scripts utilize accelerate launch
.Highlighted Details
Maintenance & Community
The project is actively developed with recent updates in June 2025. It has received media coverage from Qubit, Deep Learning and NLP, and Machine Learning and NLP. Models and datasets are available on HuggingFace. Contributions are welcomed via pull requests.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
OpenThinkIMG is an alpha release, with ongoing development. While the core system is functional for replicating paper results, some features like expanded toolsets and broader LVLM support are still under development. Specific tool requirements are noted to be released separately.
2 months ago
Inactive