Discover and explore top open-source AI tools and projects—updated daily.
MLLM research paper for reasoning via reinforcement learning
Top 69.6% on SourcePulse
R1-VL introduces a novel reinforcement learning framework, Step-wise Group Relative Policy Optimization (StepGRPO), to enhance multimodal large language models' (MLLMs) reasoning capabilities. It targets researchers and practitioners aiming to move beyond imitation learning for MLLMs, enabling them to self-improve reasoning through step-wise rewards.
How It Works
StepGRPO employs a reinforcement learning approach with two custom rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR uses soft key-step matching to reward accurate intermediate reasoning steps, while StepRVR evaluates the logical consistency and completeness of the reasoning process. This method allows models to learn from both correct and incorrect reasoning paths, fostering a deeper understanding of reasoning processes.
Quick Start & Requirements
dataset_info.json
with provided R1-VL SFT data configuration. Train using llamafactory-cli train examples/train_full/qwen2vl_2b_full_sft.yaml
.setup.sh
. Modify run_grpo_2b_vllm.sh
and run_grpo_7b.sh
with SFT model and dataset paths. Run training with bash src/r1-vl/run_grpo_2b_vllm.sh
or bash src/r1-vl/run_grpo_7b.sh
.python run.py --data MathVista_MINI --model R1-VL-7B --verbose
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
2 months ago
1 day