OpenThinkIMG by zhaochen0110

Framework for LVLMs to "think with images"

Created 1 year ago

352 stars

Top 79.5% on SourcePulse

Project Summary

OpenThinkIMG is an end-to-end framework enabling Large Vision-Language Models (LVLMs) to interactively "think with images" using a suite of vision tools. It targets researchers and developers seeking to enhance LVLM capabilities in complex visual reasoning and precise interaction tasks, offering a unified platform for tool management and a novel V-ToolRL training method for improved adaptability.

How It Works

OpenThinkIMG standardizes vision tool integration by treating each tool as an independent, modular service. This design promotes scalability and fault isolation. The framework supports both Supervised Fine-Tuning (SFT) and a novel Reinforcement Learning approach, V-ToolRL, which trains agents to dynamically discover optimal tool-usage strategies through interaction and feedback, outperforming static SFT methods.

Quick Start & Requirements

Installation: Clone the repository, create a Conda environment (Python 3.10 recommended), install PyTorch with matching CUDA (e.g., CUDA 11.8), and then install remaining dependencies via pip install -r tool_server_requirements.txt and pip install -e .. Training requires additional dependencies from requirements_train.txt.
Prerequisites: PyTorch with CUDA, accelerate library. Specific tools may have individual requirements.
Setup: Launching vision tool services involves modifying a configuration file and running python start_server_config.py. Inference and training scripts utilize accelerate launch.
Documentation: https://github.com/zhaochen0110/Tool-Factory/blob/main/README.md (Note: The provided README link is a direct reference, actual documentation links may vary within the repo structure).

Highlighted Details

V-ToolRL significantly outperforms SFT-only and zero-shot approaches on chart reasoning tasks, achieving 59.39% accuracy compared to 45.67% for SFT and even surpassing GPT-4.1 (50.71%).
Supports a diverse set of vision tools including object detection (GroundingDINO), segmentation (SAM), OCR, image cropping, and drawing annotations.
Provides a flexible training pipeline supporting both SFT and V-ToolRL, built upon the OpenR1 framework.
Actively developing pre-trained models, expanding the toolset, and adding support for more LVLM backbones (e.g., LLaVA).

Maintenance & Community

The project is actively developed with recent updates in June 2025. It has received media coverage from Qubit, Deep Learning and NLP, and Machine Learning and NLP. Models and datasets are available on HuggingFace. Contributions are welcomed via pull requests.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

OpenThinkIMG is an alpha release, with ongoing development. While the core system is functional for replicating paper results, some features like expanded toolsets and broader LVLM support are still under development. Specific tool requirements are noted to be released separately.

OpenThinkIMG by zhaochen0110

Explore Similar Projects

vla-scratch by EGalahad

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

RoboVLMs by Robot-VLAs

agentlego by InternLM

CognitiveKernel-Pro by Tencent

Trinity-RFT by agentscope-ai

ScaleCUA by OpenGVLab

supervisely by supervisely

Nemotron by NVIDIA-NeMo

Agent-R1 by AgentR1

Visual-RFT by Liuziyu77

octotools by octotools