Video reasoning in MLLMs via reinforcement learning
Top 52.6% on sourcepulse
Video-R1 addresses the challenge of enhancing video reasoning capabilities in Multimodal Large Language Models (MLLMs) by applying the Reinforcement Learning from Human Feedback (RLHF) paradigm, specifically the R1 approach. It targets researchers and developers working on video understanding and MLLMs, offering a novel method to improve temporal and spatial reasoning through specialized datasets and training techniques.
How It Works
Video-R1 introduces T-GRPO, an extension of GRPO that incorporates temporal modeling to explicitly promote temporal reasoning in MLLMs. This approach is inspired by DeepSeek-R1's success in eliciting reasoning abilities via rule-based RL. The project also utilizes two custom datasets: Video-R1-COT-165k for Supervised Fine-Tuning (SFT) and Video-R1-260k for RL training, which include image and video data, with the former benefiting from Chain-of-Thought (CoT) rationales generated by Qwen2.5-VL-72B.
Quick Start & Requirements
conda create -n video-r1 python=3.11
), activate it, and run bash setup.sh
. Install qwen-vl-utils
with pip install -e .[decord]
. Download and place datasets in src/r1-v/Video-R1-data/
and unzip. Install a specific version of transformers (transformers-main.zip
) and vLLM
(0.7.2), trl
(0.16.0).decord
, specific versions of transformers
, vLLM
, and trl
. Training requires 4x H20 (96GB) or 5x A100 (80GB) GPUs.Highlighted Details
Maintenance & Community
The project was released on March 28, 2025, with code, model weights, and datasets available on Hugging Face and ModelScope. Links to community resources are not explicitly provided in the README.
Licensing & Compatibility
The project's model weights and datasets are available on Hugging Face, typically implying a permissive license for use and distribution, though specific license details are not detailed in the README. Compatibility with commercial use is not explicitly stated.
Limitations & Caveats
The RL training was limited to 1.2k steps due to resource constraints. The model was trained with a maximum of 16 frames at 128x28x28 resolution, though inference can utilize more frames and higher resolutions. Compatibility issues with updated Transformers library versions are noted, requiring the use of a specific provided version.
6 days ago
Inactive