R1-VL  by jingyi0000

MLLM research paper for reasoning via reinforcement learning

Created 6 months ago
423 stars

Top 69.6% on SourcePulse

GitHubView on GitHub
Project Summary

R1-VL introduces a novel reinforcement learning framework, Step-wise Group Relative Policy Optimization (StepGRPO), to enhance multimodal large language models' (MLLMs) reasoning capabilities. It targets researchers and practitioners aiming to move beyond imitation learning for MLLMs, enabling them to self-improve reasoning through step-wise rewards.

How It Works

StepGRPO employs a reinforcement learning approach with two custom rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR uses soft key-step matching to reward accurate intermediate reasoning steps, while StepRVR evaluates the logical consistency and completeness of the reasoning process. This method allows models to learn from both correct and incorrect reasoning paths, fostering a deeper understanding of reasoning processes.

Quick Start & Requirements

  • SFT Warm-up: Requires LLaMA-Factory. Install via official instructions. Update dataset_info.json with provided R1-VL SFT data configuration. Train using llamafactory-cli train examples/train_full/qwen2vl_2b_full_sft.yaml.
  • RL with StepGRPO: Setup environment using setup.sh. Modify run_grpo_2b_vllm.sh and run_grpo_7b.sh with SFT model and dataset paths. Run training with bash src/r1-vl/run_grpo_2b_vllm.sh or bash src/r1-vl/run_grpo_7b.sh.
  • Evaluation: Install VLMEvalKit. Replace necessary files with provided ones. Run evaluation with python run.py --data MathVista_MINI --model R1-VL-7B --verbose.
  • Hardware: Experiments conducted on 4 H100-80GB GPUs.

Highlighted Details

  • Introduces StepGRPO, a novel online RL framework for MLLM reasoning.
  • Features StepRAR and StepRVR for dense, step-wise reasoning rewards.
  • Models R1-VL-7B and R1-VL-2B released.
  • Built upon Qwen2-VL models.

Maintenance & Community

  • Code released for RL stage (April 30, 2025) and SFT warm-up stage (April 16, 2025).
  • Models released March 22, 2025. Paper on arXiv March 17, 2025.
  • Built on R1-V, LLaMA-Factory, and VLMEvalKit codebases.

Licensing & Compatibility

  • License not explicitly stated in the README.
  • Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • RL training data preparation is marked as "Coming soon."
  • The README does not specify the license, which may impact commercial adoption.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 7 months ago
Updated 1 month ago
Feedback? Help us improve.