VisionThink by JIA-Lab-research

Vision-Language Model with Reinforcement Learning

Created 5 months ago

447 stars

Top 67.1% on SourcePulse

Project Summary

VisionThink is an open-source project focused on developing efficient Vision Language Models (VLMs) through reinforcement learning. It targets researchers and practitioners aiming to improve VLM performance and reduce computational costs, particularly for tasks requiring fine-grained visual understanding and OCR. The core innovation lies in using reinforcement learning to autonomously learn when to reduce visual tokens, leading to significant efficiency gains without sacrificing accuracy.

How It Works

VisionThink employs reinforcement learning to train a VLM to dynamically decide whether to discard visual tokens. This approach allows the model to adapt its processing based on the input, focusing computational resources on relevant visual information. By learning an optimal token reduction strategy, it achieves substantial efficiency improvements, particularly on benchmarks sensitive to visual detail.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (conda create -n visionthink python=3.11), activate it, and install dependencies: pip3 install -e . flash-attn. Additional installs for Qwen3 judge model: pip install -U tensordict transformers==4.51.0.
Data: Download datasets from Hugging Face (e.g., Senqiao/VisionThink-General-Train).
Prerequisites: Python 3.11, CUDA, flash-attn.
Resources: Requires significant computational resources for training and evaluation, including multiple GPUs.
Documentation: Paper

Highlighted Details

Achieves 102% of original model performance on General VQA tasks while reducing visual tokens by 50%.
Demonstrates significant improvements on OCR-related tasks.
Supports both GPT-4o and Qwen3 as reward models for training.
Evaluation framework based on Lmms-Eval.

Maintenance & Community

The project is actively maintained by dvlab-research. It builds upon several other open-source projects, including Verl, EasyR1, and Lmms-Eval.

Licensing & Compatibility

VisionThink is licensed under the Apache License 2.0, which permits commercial use and integration with closed-source projects.

Limitations & Caveats

The project is relatively new, with the paper and repository released in July 2025. While it claims significant improvements, extensive independent benchmarking across diverse tasks would be beneficial. The setup for using GPT as a reward model requires specific Azure API keys.

VisionThink by JIA-Lab-research

Explore Similar Projects

Efficient-R1-VLLM by baibizhe

MM-Vet by yuweihao

VisionZip by JIA-Lab-research

VLM2Vec by TIGER-AI-Lab

lmm-r1 by TideDra

molmo by allenai

flow_grpo by yifan123

Awesome-RL-for-LRMs by TsinghuaC3I

simpleRL-reason by hkust-nlp

R1-V by StarsfieldAI

EasyR1 by hiyouga

lmms-eval by EvolvingLMMs-Lab