Long-RL by NVlabs

Framework for scaling RL to long video sequences

Created 6 months ago

683 stars

Top 49.8% on SourcePulse

Project Summary

This repository provides a full-stack framework, Long-RL, for scaling Reinforcement Learning (RL) to long video reasoning tasks. It addresses challenges in processing extended video sequences by integrating a large-scale dataset (LongVideo-Reason), a two-stage training pipeline (CoT-SFT and RL), and an efficient training infrastructure (MR-SP). The framework is designed for researchers and engineers working with vision-language models (VLMs) and long-form video content.

How It Works

Long-RL employs a two-stage training process: Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) followed by Reinforcement Learning (RL). The core innovation lies in the Multi-modal Reinforcement Sequence Parallelism (MR-SP) training infrastructure, which utilizes sequence parallelism and a vLLM-based engine. This approach enables efficient processing of long videos by caching video embeddings and employing prefilling techniques, significantly speeding up RL training for extended sequences.

Quick Start & Requirements

Installation: git clone https://github.com/NVlabs/Long-RL.git then cd Long-RL and pip install -e .
Prerequisites: vLLM (tested on v0.9.1), Hugging Face Transformers, Python. Specific model requirements may vary. GPU memory is a key consideration for long video processing.
Demo: A Gradio demo is available at https://long-rl.hanlab.ai.
Model Weights: LongVILA-R1-7B weights are available on HuggingFace: https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B.

Highlighted Details

Supports hour-level long video RL training (3,600 frames) on a single A100 node (8 GPUs) using sequence parallelism.
Enables RL training for multi-modal inputs (text, video, audio) and image/video generation models (e.g., Stable Diffusion).
Offers features like open-ended reward support, cached video embeddings for faster training, and chunked gathering to manage memory.
Achieves up to 2.1x speedup in long video RL training with the MR-SP system.

Maintenance & Community

The project is actively maintained, with recent updates in July 2025. Key contributors include Yukang Chen, Wei Huang, and Song Han. The project builds upon EasyR1 and verl frameworks.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it acknowledges dependencies on EasyR1, verl, vLLM, and Flow-GRPO, whose licenses should be reviewed for compatibility with commercial or closed-source use.

Limitations & Caveats

The framework is primarily demonstrated with VILA and Qwen series models, though it supports RL training on various modalities and models. Specific hardware configurations, particularly GPU memory, are critical for processing long video sequences effectively. The README does not detail specific Python version requirements beyond general compatibility.

Long-RL by NVlabs

Explore Similar Projects

Inf-CLIP by DAMO-NLP-SG

VideoChat-Flash by OpenGVLab

LongVA by EvolvingLMMs-Lab

Long-VITA by VITA-MLLM

MPP-LLaVA by Coobiw

X-Temporal by Sense-X

Qwen-VL-Series-Finetune by 2U1

FastVideo by hao-ai-lab

vllm-omni by vllm-project

VILA by NVlabs

LWM by LargeWorldModel

minimind-v by jingyaogong