Research paper replicating visual reasoning "aha moment" on a 2B model
Top 54.9% on sourcepulse
This project introduces VisualThinker-R1-Zero, a multimodal reasoning model that demonstrates emergent "aha moments" and increased response length on a 2B parameter, non-SFT (Supervised Fine-Tuning) foundation. It targets researchers and practitioners interested in the capabilities of smaller, un-SFTed models for vision-centric reasoning tasks, showcasing self-correction and improved reasoning without extensive post-training.
How It Works
VisualThinker-R1-Zero builds upon the Qwen-VL-2B model, employing GRPO (Generative Reinforcement Learning from Human Feedback) for training without SFT or reward models. This approach aims to elicit emergent reasoning abilities, including self-reflection and correction, directly from the base model through reinforcement learning on visual reasoning tasks. The advantage lies in achieving advanced reasoning capabilities on a compact model size without the typical SFT overhead.
Quick Start & Requirements
setup.sh
for setup, then run_grpo_SAT.sh
for GRPO training or run_sft.sh
for SFT training. Evaluation scripts are in src/eval
.prepare_dataset.sh
.Highlighted Details
Maintenance & Community
The project is actively maintained by TurningPoint AI, with contributors from UCLA, Penn State, University of Maryland, and Google Research. Contact information is available on their homepage.
Licensing & Compatibility
The repository and model checkpoints are released under a permissive license, allowing for research and commercial use. Specific license details are not explicitly stated in the README but are implied by the open release and acknowledgments of open-source resources.
Limitations & Caveats
The primary requirement of 4x 80GB GPUs for fine-tuning presents a significant hardware barrier. While the project highlights emergent abilities, the extent of these capabilities and their robustness across diverse visual reasoning tasks may require further investigation.
4 months ago
1 day