Open-R1-Video by Wang-Xiaodong1899

Video-LLM for video understanding tasks, inspired by the R1 paradigm

Created 10 months ago

380 stars

Top 75.1% on SourcePulse

Project Summary

This project introduces an open-source implementation of the R1 paradigm for video understanding tasks, targeting researchers and developers working with multimodal AI. It aims to improve video comprehension by leveraging reinforcement learning for training large language models, offering a novel approach to video-LLM development.

How It Works

The project adapts the R1 framework to video-LLMs, specifically Qwen2-VL, using GRPO (a pure reinforcement learning method without labeled reasoning trajectories). This approach trains the model on video, query, and answer pairs, optimizing for improved video understanding through reward signals derived from answer accuracy. The training data is a simplified format designed for GRPO.

Quick Start & Requirements

Install: Clone the repository, create and activate a conda environment (conda create -n r1 python=3.10, conda activate r1), install dependencies (pip3 install -e ".[dev]", pip3 install flash_attn --no-build-isolation), and install utilities (cd qwen-vl-utils; pip install -e .).
Data: Download LLaVA-Video-large-swift-origin.jsonl to data/ and clone the video dataset using git lfs install and git clone https://huggingface.co/datasets/malterei/LLaVA-Video-large-swift.
Prerequisites: Python 3.10, CUDA (implied by flash_attn), 4x A100 (80GB) GPUs recommended for training, but single GPU training is supported.
Resources: Training requires significant GPU resources. Inference scripts are provided.
Links: Models, Datasets, Wandb Logs.

Highlighted Details

Introduces R1 paradigm to video-LLMs (e.g., Qwen2-VL).
Open-sources training code and a simplified video dataset (open-r1-video-4k).
Utilizes GRPO for training, achieving a reported 7.1 increase on LongVideoBench for the Open-R1-Video-7B model compared to a non-reasoning baseline.
Provides inference scripts and evaluation benchmarks using Lmms-eval.

Maintenance & Community

The project is actively developed with recent updates in February 2025. Community feedback and discussions are welcomed.

Licensing & Compatibility

The repository is available under an unspecified license. The project acknowledges contributions from various open-source communities, including DeepSeek and LLaVA.

Limitations & Caveats

The project notes that insights may not be guaranteed to be correct. The training commands are configured for specific hardware (4x A100 80GB) and may require tuning for different setups. The provisional model and dataset are released with ongoing development.

Open-R1-Video by Wang-Xiaodong1899

Explore Similar Projects

lynx-llm by bytedance

GPT4Scene-and-VLN-R1 by Qi-Zhangyang

unified_video_action by ShuangLI59

tarsier by bytedance

VideoTuna by VideoVerses

Long-VITA by VITA-MLLM

Vitron by SkyworkAI

Video-R1 by tulerfeng

Awesome-LLMs-for-Video-Understanding by yunlong10

Qwen-VL-Series-Finetune by 2U1

R1-V by StarsfieldAI

EasyR1 by hiyouga