Video-R1  by tulerfeng

Video reasoning in MLLMs via reinforcement learning

created 5 months ago
645 stars

Top 52.6% on sourcepulse

GitHubView on GitHub
Project Summary

Video-R1 addresses the challenge of enhancing video reasoning capabilities in Multimodal Large Language Models (MLLMs) by applying the Reinforcement Learning from Human Feedback (RLHF) paradigm, specifically the R1 approach. It targets researchers and developers working on video understanding and MLLMs, offering a novel method to improve temporal and spatial reasoning through specialized datasets and training techniques.

How It Works

Video-R1 introduces T-GRPO, an extension of GRPO that incorporates temporal modeling to explicitly promote temporal reasoning in MLLMs. This approach is inspired by DeepSeek-R1's success in eliciting reasoning abilities via rule-based RL. The project also utilizes two custom datasets: Video-R1-COT-165k for Supervised Fine-Tuning (SFT) and Video-R1-260k for RL training, which include image and video data, with the former benefiting from Chain-of-Thought (CoT) rationales generated by Qwen2.5-VL-72B.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n video-r1 python=3.11), activate it, and run bash setup.sh. Install qwen-vl-utils with pip install -e .[decord]. Download and place datasets in src/r1-v/Video-R1-data/ and unzip. Install a specific version of transformers (transformers-main.zip) and vLLM (0.7.2), trl (0.16.0).
  • Prerequisites: Python 3.11, Conda, Git LFS, decord, specific versions of transformers, vLLM, and trl. Training requires 4x H20 (96GB) or 5x A100 (80GB) GPUs.
  • Resources: Training involves SFT (1 epoch) followed by RL training (1.2k steps). Inference can use increased frame resolutions (up to 256x28x28) and frame counts (16/32/64).
  • Links: Paper, Video-R1-7B-model, Video-R1-train-data.

Highlighted Details

  • Achieves 35.8% accuracy on VSI-Bench (video spatial reasoning), outperforming GPT-4o.
  • Supports Qwen2.5-VL, vLLM for training/inference, and image-video mixed training.
  • Offers a full pipeline including dataset creation, CoT annotation, SFT, RL training, and evaluation.
  • Demonstrates emergent "aha moment" self-reflection reasoning behaviors during RL training.

Maintenance & Community

The project was released on March 28, 2025, with code, model weights, and datasets available on Hugging Face and ModelScope. Links to community resources are not explicitly provided in the README.

Licensing & Compatibility

The project's model weights and datasets are available on Hugging Face, typically implying a permissive license for use and distribution, though specific license details are not detailed in the README. Compatibility with commercial use is not explicitly stated.

Limitations & Caveats

The RL training was limited to 1.2k steps due to resource constraints. The model was trained with a maximum of 16 frames at 128x28x28 resolution, though inference can utilize more frames and higher resolutions. Compatibility issues with updated Transformers library versions are noted, requiring the use of a specific provided version.

Health Check
Last commit

6 days ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
10
Star History
165 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.