Long-RL  by NVlabs

Framework for scaling RL to long video sequences

Created 2 months ago
609 stars

Top 53.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a full-stack framework, Long-RL, for scaling Reinforcement Learning (RL) to long video reasoning tasks. It addresses challenges in processing extended video sequences by integrating a large-scale dataset (LongVideo-Reason), a two-stage training pipeline (CoT-SFT and RL), and an efficient training infrastructure (MR-SP). The framework is designed for researchers and engineers working with vision-language models (VLMs) and long-form video content.

How It Works

Long-RL employs a two-stage training process: Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) followed by Reinforcement Learning (RL). The core innovation lies in the Multi-modal Reinforcement Sequence Parallelism (MR-SP) training infrastructure, which utilizes sequence parallelism and a vLLM-based engine. This approach enables efficient processing of long videos by caching video embeddings and employing prefilling techniques, significantly speeding up RL training for extended sequences.

Quick Start & Requirements

  • Installation: git clone https://github.com/NVlabs/Long-RL.git then cd Long-RL and pip install -e .
  • Prerequisites: vLLM (tested on v0.9.1), Hugging Face Transformers, Python. Specific model requirements may vary. GPU memory is a key consideration for long video processing.
  • Demo: A Gradio demo is available at https://long-rl.hanlab.ai.
  • Model Weights: LongVILA-R1-7B weights are available on HuggingFace: https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B.

Highlighted Details

  • Supports hour-level long video RL training (3,600 frames) on a single A100 node (8 GPUs) using sequence parallelism.
  • Enables RL training for multi-modal inputs (text, video, audio) and image/video generation models (e.g., Stable Diffusion).
  • Offers features like open-ended reward support, cached video embeddings for faster training, and chunked gathering to manage memory.
  • Achieves up to 2.1x speedup in long video RL training with the MR-SP system.

Maintenance & Community

The project is actively maintained, with recent updates in July 2025. Key contributors include Yukang Chen, Wei Huang, and Song Han. The project builds upon EasyR1 and verl frameworks.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it acknowledges dependencies on EasyR1, verl, vLLM, and Flow-GRPO, whose licenses should be reviewed for compatibility with commercial or closed-source use.

Limitations & Caveats

The framework is primarily demonstrated with VILA and Qwen series models, though it supports RL training on various modalities and models. Specific hardware configurations, particularly GPU memory, are critical for processing long video sequences effectively. The README does not detail specific Python version requirements beyond general compatibility.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
7
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.