VisualThinker-R1-Zero by turningpoint-ai

Research paper replicating visual reasoning "aha moment" on a 2B model

Created 10 months ago

621 stars

Top 53.2% on SourcePulse

Project Summary

This project introduces VisualThinker-R1-Zero, a multimodal reasoning model that demonstrates emergent "aha moments" and increased response length on a 2B parameter, non-SFT (Supervised Fine-Tuning) foundation. It targets researchers and practitioners interested in the capabilities of smaller, un-SFTed models for vision-centric reasoning tasks, showcasing self-correction and improved reasoning without extensive post-training.

How It Works

VisualThinker-R1-Zero builds upon the Qwen-VL-2B model, employing GRPO (Generative Reinforcement Learning from Human Feedback) for training without SFT or reward models. This approach aims to elicit emergent reasoning abilities, including self-reflection and correction, directly from the base model through reinforcement learning on visual reasoning tasks. The advantage lies in achieving advanced reasoning capabilities on a compact model size without the typical SFT overhead.

Quick Start & Requirements

Install/Run: Execute setup.sh for setup, then run_grpo_SAT.sh for GRPO training or run_sft.sh for SFT training. Evaluation scripts are in src/eval.
Prerequisites: Requires 4x 80GB GPUs for GRPO Full Fine-Tuning (AMP). Dataset preparation involves prepare_dataset.sh.
Resources: Full fine-tuning requires significant GPU memory.
Links: Model checkpoint: huggingface, Findings: notion blog, Repo: GitHub

Highlighted Details

First observed emergent "aha moment" and increased response length in visual reasoning on a 2B non-SFT model.
Demonstrates self-reflection and error correction capabilities similar to DeepSeek R1.
Shows that vision-centric tasks benefit from improved reasoning capabilities.
Evaluated on CVBench for multimodal reasoning performance.

Maintenance & Community

The project is actively maintained by TurningPoint AI, with contributors from UCLA, Penn State, University of Maryland, and Google Research. Contact information is available on their homepage.

Licensing & Compatibility

The repository and model checkpoints are released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README but are implied by the open release and acknowledgments of open-source resources.

Limitations & Caveats

The primary requirement of 4x 80GB GPUs for fine-tuning presents a significant hardware barrier. While the project highlights emergent abilities, the extent of these capabilities and their robustness across diverse visual reasoning tasks may require further investigation.

VisualThinker-R1-Zero by turningpoint-ai

Explore Similar Projects

Awesome_Efficient_LRM_Reasoning by XiaoYee

Vision-R1 by Osilly

Awesome-Large-Multimodal-Reasoning-Models by HITsz-TMG

Awesome-Efficient-Reasoning-Models by fscdc

OneThinker by tulerfeng

Tina by shangshang-wang

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

Awesome-System2-Reasoning-LLM by zzli2022

rStar by zhentingqi

train-deepseek-r1 by FareedKhan-dev

Awesome-LLM-Reasoning by atfortes

VLM-R1 by om-ai-lab