Open-R1-Video  by Wang-Xiaodong1899

Video-LLM for video understanding tasks, inspired by the R1 paradigm

created 5 months ago
352 stars

Top 80.3% on sourcepulse

GitHubView on GitHub
Project Summary

This project introduces an open-source implementation of the R1 paradigm for video understanding tasks, targeting researchers and developers working with multimodal AI. It aims to improve video comprehension by leveraging reinforcement learning for training large language models, offering a novel approach to video-LLM development.

How It Works

The project adapts the R1 framework to video-LLMs, specifically Qwen2-VL, using GRPO (a pure reinforcement learning method without labeled reasoning trajectories). This approach trains the model on video, query, and answer pairs, optimizing for improved video understanding through reward signals derived from answer accuracy. The training data is a simplified format designed for GRPO.

Quick Start & Requirements

  • Install: Clone the repository, create and activate a conda environment (conda create -n r1 python=3.10, conda activate r1), install dependencies (pip3 install -e ".[dev]", pip3 install flash_attn --no-build-isolation), and install utilities (cd qwen-vl-utils; pip install -e .).
  • Data: Download LLaVA-Video-large-swift-origin.jsonl to data/ and clone the video dataset using git lfs install and git clone https://huggingface.co/datasets/malterei/LLaVA-Video-large-swift.
  • Prerequisites: Python 3.10, CUDA (implied by flash_attn), 4x A100 (80GB) GPUs recommended for training, but single GPU training is supported.
  • Resources: Training requires significant GPU resources. Inference scripts are provided.
  • Links: Models, Datasets, Wandb Logs.

Highlighted Details

  • Introduces R1 paradigm to video-LLMs (e.g., Qwen2-VL).
  • Open-sources training code and a simplified video dataset (open-r1-video-4k).
  • Utilizes GRPO for training, achieving a reported 7.1 increase on LongVideoBench for the Open-R1-Video-7B model compared to a non-reasoning baseline.
  • Provides inference scripts and evaluation benchmarks using Lmms-eval.

Maintenance & Community

The project is actively developed with recent updates in February 2025. Community feedback and discussions are welcomed.

Licensing & Compatibility

The repository is available under an unspecified license. The project acknowledges contributions from various open-source communities, including DeepSeek and LLaVA.

Limitations & Caveats

The project notes that insights may not be guaranteed to be correct. The training commands are configured for specific hardware (4x A100 80GB) and may require tuning for different setups. The provisional model and dataset are released with ongoing development.

Health Check
Last commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
24 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.