OpenThinkIMG  by zhaochen0110

Framework for LVLMs to "think with images"

created 8 months ago
293 stars

Top 90.1% on SourcePulse

GitHubView on GitHub
Project Summary

OpenThinkIMG is an end-to-end framework enabling Large Vision-Language Models (LVLMs) to interactively "think with images" using a suite of vision tools. It targets researchers and developers seeking to enhance LVLM capabilities in complex visual reasoning and precise interaction tasks, offering a unified platform for tool management and a novel V-ToolRL training method for improved adaptability.

How It Works

OpenThinkIMG standardizes vision tool integration by treating each tool as an independent, modular service. This design promotes scalability and fault isolation. The framework supports both Supervised Fine-Tuning (SFT) and a novel Reinforcement Learning approach, V-ToolRL, which trains agents to dynamically discover optimal tool-usage strategies through interaction and feedback, outperforming static SFT methods.

Quick Start & Requirements

  • Installation: Clone the repository, create a Conda environment (Python 3.10 recommended), install PyTorch with matching CUDA (e.g., CUDA 11.8), and then install remaining dependencies via pip install -r tool_server_requirements.txt and pip install -e .. Training requires additional dependencies from requirements_train.txt.
  • Prerequisites: PyTorch with CUDA, accelerate library. Specific tools may have individual requirements.
  • Setup: Launching vision tool services involves modifying a configuration file and running python start_server_config.py. Inference and training scripts utilize accelerate launch.
  • Documentation: https://github.com/zhaochen0110/Tool-Factory/blob/main/README.md (Note: The provided README link is a direct reference, actual documentation links may vary within the repo structure).

Highlighted Details

  • V-ToolRL significantly outperforms SFT-only and zero-shot approaches on chart reasoning tasks, achieving 59.39% accuracy compared to 45.67% for SFT and even surpassing GPT-4.1 (50.71%).
  • Supports a diverse set of vision tools including object detection (GroundingDINO), segmentation (SAM), OCR, image cropping, and drawing annotations.
  • Provides a flexible training pipeline supporting both SFT and V-ToolRL, built upon the OpenR1 framework.
  • Actively developing pre-trained models, expanding the toolset, and adding support for more LVLM backbones (e.g., LLaVA).

Maintenance & Community

The project is actively developed with recent updates in June 2025. It has received media coverage from Qubit, Deep Learning and NLP, and Machine Learning and NLP. Models and datasets are available on HuggingFace. Contributions are welcomed via pull requests.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

OpenThinkIMG is an alpha release, with ongoing development. While the core system is functional for replicating paper results, some features like expanded toolsets and broader LVLM support are still under development. Specific tool requirements are noted to be released separately.

Health Check
Last commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
30 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
16 more.

open-r1 by huggingface

0.3%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 4 days ago
Feedback? Help us improve.