VisualThinker-R1-Zero  by turningpoint-ai

Research paper replicating visual reasoning "aha moment" on a 2B model

Created 1 year ago
623 stars

Top 53.0% on SourcePulse

GitHubView on GitHub
Project Summary

This project introduces VisualThinker-R1-Zero, a multimodal reasoning model that demonstrates emergent "aha moments" and increased response length on a 2B parameter, non-SFT (Supervised Fine-Tuning) foundation. It targets researchers and practitioners interested in the capabilities of smaller, un-SFTed models for vision-centric reasoning tasks, showcasing self-correction and improved reasoning without extensive post-training.

How It Works

VisualThinker-R1-Zero builds upon the Qwen-VL-2B model, employing GRPO (Generative Reinforcement Learning from Human Feedback) for training without SFT or reward models. This approach aims to elicit emergent reasoning abilities, including self-reflection and correction, directly from the base model through reinforcement learning on visual reasoning tasks. The advantage lies in achieving advanced reasoning capabilities on a compact model size without the typical SFT overhead.

Quick Start & Requirements

  • Install/Run: Execute setup.sh for setup, then run_grpo_SAT.sh for GRPO training or run_sft.sh for SFT training. Evaluation scripts are in src/eval.
  • Prerequisites: Requires 4x 80GB GPUs for GRPO Full Fine-Tuning (AMP). Dataset preparation involves prepare_dataset.sh.
  • Resources: Full fine-tuning requires significant GPU memory.
  • Links: Model checkpoint: huggingface, Findings: notion blog, Repo: GitHub

Highlighted Details

  • First observed emergent "aha moment" and increased response length in visual reasoning on a 2B non-SFT model.
  • Demonstrates self-reflection and error correction capabilities similar to DeepSeek R1.
  • Shows that vision-centric tasks benefit from improved reasoning capabilities.
  • Evaluated on CVBench for multimodal reasoning performance.

Maintenance & Community

The project is actively maintained by TurningPoint AI, with contributors from UCLA, Penn State, University of Maryland, and Google Research. Contact information is available on their homepage.

Licensing & Compatibility

The repository and model checkpoints are released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README but are implied by the open release and acknowledgments of open-source resources.

Limitations & Caveats

The primary requirement of 4x 80GB GPUs for fine-tuning presents a significant hardware barrier. While the project highlights emergent abilities, the extent of these capabilities and their robustness across diverse visual reasoning tasks may require further investigation.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.