Vision-Language Model with Reinforcement Learning
New!
Top 82.8% on sourcepulse
VisionThink is an open-source project focused on developing efficient Vision Language Models (VLMs) through reinforcement learning. It targets researchers and practitioners aiming to improve VLM performance and reduce computational costs, particularly for tasks requiring fine-grained visual understanding and OCR. The core innovation lies in using reinforcement learning to autonomously learn when to reduce visual tokens, leading to significant efficiency gains without sacrificing accuracy.
How It Works
VisionThink employs reinforcement learning to train a VLM to dynamically decide whether to discard visual tokens. This approach allows the model to adapt its processing based on the input, focusing computational resources on relevant visual information. By learning an optimal token reduction strategy, it achieves substantial efficiency improvements, particularly on benchmarks sensitive to visual detail.
Quick Start & Requirements
conda create -n visionthink python=3.11
), activate it, and install dependencies: pip3 install -e . flash-attn
. Additional installs for Qwen3 judge model: pip install -U tensordict transformers==4.51.0
.Senqiao/VisionThink-General-Train
).flash-attn
.Highlighted Details
Maintenance & Community
The project is actively maintained by dvlab-research. It builds upon several other open-source projects, including Verl, EasyR1, and Lmms-Eval.
Licensing & Compatibility
VisionThink is licensed under the Apache License 2.0, which permits commercial use and integration with closed-source projects.
Limitations & Caveats
The project is relatively new, with the paper and repository released in July 2025. While it claims significant improvements, extensive independent benchmarking across diverse tasks would be beneficial. The setup for using GPT as a reward model requires specific Azure API keys.
1 week ago
Inactive