VisionThink  by dvlab-research

Vision-Language Model with Reinforcement Learning

created 2 weeks ago

New!

337 stars

Top 82.8% on sourcepulse

GitHubView on GitHub
Project Summary

VisionThink is an open-source project focused on developing efficient Vision Language Models (VLMs) through reinforcement learning. It targets researchers and practitioners aiming to improve VLM performance and reduce computational costs, particularly for tasks requiring fine-grained visual understanding and OCR. The core innovation lies in using reinforcement learning to autonomously learn when to reduce visual tokens, leading to significant efficiency gains without sacrificing accuracy.

How It Works

VisionThink employs reinforcement learning to train a VLM to dynamically decide whether to discard visual tokens. This approach allows the model to adapt its processing based on the input, focusing computational resources on relevant visual information. By learning an optimal token reduction strategy, it achieves substantial efficiency improvements, particularly on benchmarks sensitive to visual detail.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (conda create -n visionthink python=3.11), activate it, and install dependencies: pip3 install -e . flash-attn. Additional installs for Qwen3 judge model: pip install -U tensordict transformers==4.51.0.
  • Data: Download datasets from Hugging Face (e.g., Senqiao/VisionThink-General-Train).
  • Prerequisites: Python 3.11, CUDA, flash-attn.
  • Resources: Requires significant computational resources for training and evaluation, including multiple GPUs.
  • Documentation: Paper

Highlighted Details

  • Achieves 102% of original model performance on General VQA tasks while reducing visual tokens by 50%.
  • Demonstrates significant improvements on OCR-related tasks.
  • Supports both GPT-4o and Qwen3 as reward models for training.
  • Evaluation framework based on Lmms-Eval.

Maintenance & Community

The project is actively maintained by dvlab-research. It builds upon several other open-source projects, including Verl, EasyR1, and Lmms-Eval.

Licensing & Compatibility

VisionThink is licensed under the Apache License 2.0, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The project is relatively new, with the paper and repository released in July 2025. While it claims significant improvements, extensive independent benchmarking across diverse tasks would be beneficial. The setup for using GPT as a reward model requires specific Azure API keys.

Health Check
Last commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
5
Star History
338 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.