R1-VL by jingyi0000

MLLM research paper for reasoning via reinforcement learning

Created 10 months ago

450 stars

Top 66.8% on SourcePulse

Project Summary

R1-VL introduces a novel reinforcement learning framework, Step-wise Group Relative Policy Optimization (StepGRPO), to enhance multimodal large language models' (MLLMs) reasoning capabilities. It targets researchers and practitioners aiming to move beyond imitation learning for MLLMs, enabling them to self-improve reasoning through step-wise rewards.

How It Works

StepGRPO employs a reinforcement learning approach with two custom rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR uses soft key-step matching to reward accurate intermediate reasoning steps, while StepRVR evaluates the logical consistency and completeness of the reasoning process. This method allows models to learn from both correct and incorrect reasoning paths, fostering a deeper understanding of reasoning processes.

Quick Start & Requirements

SFT Warm-up: Requires LLaMA-Factory. Install via official instructions. Update dataset_info.json with provided R1-VL SFT data configuration. Train using llamafactory-cli train examples/train_full/qwen2vl_2b_full_sft.yaml.
RL with StepGRPO: Setup environment using setup.sh. Modify run_grpo_2b_vllm.sh and run_grpo_7b.sh with SFT model and dataset paths. Run training with bash src/r1-vl/run_grpo_2b_vllm.sh or bash src/r1-vl/run_grpo_7b.sh.
Evaluation: Install VLMEvalKit. Replace necessary files with provided ones. Run evaluation with python run.py --data MathVista_MINI --model R1-VL-7B --verbose.
Hardware: Experiments conducted on 4 H100-80GB GPUs.

Highlighted Details

Introduces StepGRPO, a novel online RL framework for MLLM reasoning.
Features StepRAR and StepRVR for dense, step-wise reasoning rewards.
Models R1-VL-7B and R1-VL-2B released.
Built upon Qwen2-VL models.

Maintenance & Community

Code released for RL stage (April 30, 2025) and SFT warm-up stage (April 16, 2025).
Models released March 22, 2025. Paper on arXiv March 17, 2025.
Built on R1-V, LLaMA-Factory, and VLMEvalKit codebases.

Licensing & Compatibility

License not explicitly stated in the README.
Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

RL training data preparation is marked as "Coming soon."
The README does not specify the license, which may impact commercial adoption.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

2

Star History

8 stars in the last 30 days

Explore Similar Projects

Starred by

Jeff Hammerbacher

Jeff Hammerbacher(Cofounder of Cloudera).

Husky-v1 by agent-husky

Open-source language agent for complex, multi-step reasoning tasks

Created 1 year ago

Updated 1 year ago

Online-DPO-R1 by RLHFlow

Codebase for iterative DPO using rule-based rewards

Created 11 months ago

Updated 9 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

POLARIS by ChenxinAn-fdu

Scaling RL for advanced reasoning models

Created 6 months ago

Updated 2 months ago

M_GRPO by baibizhe

Stabilizing LLM reasoning with self-supervised RL

Created 4 months ago

Updated 2 months ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

Agent-R1 by 0russwest0

RL framework for training LLM agents via end-to-end reinforcement learning

Created 10 months ago

Updated 1 month ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Lewis Tunstall

Lewis Tunstall(Research Engineer at Hugging Face), and

6 more.

PRIME by PRIME-RL

Scalable RL solution for advanced reasoning of language models

Created 1 year ago

Updated 9 months ago

Starred by

Casper Hansen

Casper Hansen(Author of AutoAWQ),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

2 more.

rStar by zhentingqi

Research paper for improving small LLM reasoning via mutual reasoning

Created 1 year ago

Updated 11 months ago

Mulberry by HJYao00

MLLM research paper for reasoning/reflection via collective Monte Carlo Tree Search

Created 1 year ago

Updated 3 months ago

Starred by

Yiran Wu

Yiran Wu(Coauthor of AutoGen).

Awesome-RL-for-LRMs by TsinghuaC3I

RL recipes for reasoning, covering models, datasets, reward design, and optimization

Created 9 months ago

Updated 2 months ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

Logic-RL by Unakar

LLM reasoning via rule-based reinforcement learning, research paper

Created 11 months ago

Updated 9 months ago

Starred by

Vincent Weisser

Vincent Weisser(Cofounder of Prime Intellect),

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and

4 more.

simpleRL-reason by hkust-nlp

RL recipe for reasoning ability in models

Created 11 months ago

Updated 2 weeks ago

Starred by

Wing Lian

Wing Lian(Founder of Axolotl AI) and

Yiran Wu

Yiran Wu(Coauthor of AutoGen).

Absolute-Zero-Reasoner by LeapLabTHU

Self-play reasoning framework needing zero data

Created 8 months ago

Updated 4 months ago

Feedback? Help us improve.