VisualThinker-R1-Zero  by turningpoint-ai

Research paper replicating visual reasoning "aha moment" on a 2B model

created 5 months ago
605 stars

Top 54.9% on sourcepulse

GitHubView on GitHub
Project Summary

This project introduces VisualThinker-R1-Zero, a multimodal reasoning model that demonstrates emergent "aha moments" and increased response length on a 2B parameter, non-SFT (Supervised Fine-Tuning) foundation. It targets researchers and practitioners interested in the capabilities of smaller, un-SFTed models for vision-centric reasoning tasks, showcasing self-correction and improved reasoning without extensive post-training.

How It Works

VisualThinker-R1-Zero builds upon the Qwen-VL-2B model, employing GRPO (Generative Reinforcement Learning from Human Feedback) for training without SFT or reward models. This approach aims to elicit emergent reasoning abilities, including self-reflection and correction, directly from the base model through reinforcement learning on visual reasoning tasks. The advantage lies in achieving advanced reasoning capabilities on a compact model size without the typical SFT overhead.

Quick Start & Requirements

  • Install/Run: Execute setup.sh for setup, then run_grpo_SAT.sh for GRPO training or run_sft.sh for SFT training. Evaluation scripts are in src/eval.
  • Prerequisites: Requires 4x 80GB GPUs for GRPO Full Fine-Tuning (AMP). Dataset preparation involves prepare_dataset.sh.
  • Resources: Full fine-tuning requires significant GPU memory.
  • Links: Model checkpoint: huggingface, Findings: notion blog, Repo: GitHub

Highlighted Details

  • First observed emergent "aha moment" and increased response length in visual reasoning on a 2B non-SFT model.
  • Demonstrates self-reflection and error correction capabilities similar to DeepSeek R1.
  • Shows that vision-centric tasks benefit from improved reasoning capabilities.
  • Evaluated on CVBench for multimodal reasoning performance.

Maintenance & Community

The project is actively maintained by TurningPoint AI, with contributors from UCLA, Penn State, University of Maryland, and Google Research. Contact information is available on their homepage.

Licensing & Compatibility

The repository and model checkpoints are released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README but are implied by the open release and acknowledgments of open-source resources.

Limitations & Caveats

The primary requirement of 4x 80GB GPUs for fine-tuning presents a significant hardware barrier. While the project highlights emergent abilities, the extent of these capabilities and their robustness across diverse visual reasoning tasks may require further investigation.

Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
25 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.