Video-LLM for video understanding tasks, inspired by the R1 paradigm
Top 80.3% on sourcepulse
This project introduces an open-source implementation of the R1 paradigm for video understanding tasks, targeting researchers and developers working with multimodal AI. It aims to improve video comprehension by leveraging reinforcement learning for training large language models, offering a novel approach to video-LLM development.
How It Works
The project adapts the R1 framework to video-LLMs, specifically Qwen2-VL, using GRPO (a pure reinforcement learning method without labeled reasoning trajectories). This approach trains the model on video, query, and answer pairs, optimizing for improved video understanding through reward signals derived from answer accuracy. The training data is a simplified format designed for GRPO.
Quick Start & Requirements
conda create -n r1 python=3.10
, conda activate r1
), install dependencies (pip3 install -e ".[dev]"
, pip3 install flash_attn --no-build-isolation
), and install utilities (cd qwen-vl-utils; pip install -e .
).LLaVA-Video-large-swift-origin.jsonl
to data/
and clone the video dataset using git lfs install
and git clone https://huggingface.co/datasets/malterei/LLaVA-Video-large-swift
.flash_attn
), 4x A100 (80GB) GPUs recommended for training, but single GPU training is supported.Highlighted Details
open-r1-video-4k
).Maintenance & Community
The project is actively developed with recent updates in February 2025. Community feedback and discussions are welcomed.
Licensing & Compatibility
The repository is available under an unspecified license. The project acknowledges contributions from various open-source communities, including DeepSeek and LLaVA.
Limitations & Caveats
The project notes that insights may not be guaranteed to be correct. The training commands are configured for specific hardware (4x A100 80GB) and may require tuning for different setups. The provisional model and dataset are released with ongoing development.
5 months ago
1 day