DAPO by BytedTsinghua-SIA

Open-source RL system for large-scale LLM training

Created 10 months ago

1,701 stars

Top 24.7% on SourcePulse

View on GitHub

2 Experts Love This Project

Pawel Garbacki

Cofounder of Fireworks AI

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

DAPO is an open-source reinforcement learning system designed for large-scale LLM training, developed by ByteDance Seed and Tsinghua AIR. It provides a complete solution including algorithms, code infrastructure, and datasets, aiming to democratize access to advanced RL techniques for the research community.

How It Works

DAPO introduces a novel algorithm, Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), which enhances stability and performance in LLM RL. The approach focuses on metric supervision during training, specifically monitoring length stability, reward score stability, and a controlled trend in entropy and mean probability. This ensures robust learning by balancing exploration and exploitation, preventing overfitting and promoting consistent performance gains.

Quick Start & Requirements

Install: Use conda create -n dapo python=3.10 and conda activate dapo, followed by pip3 install vllm==0.8.2.
Prerequisites: Python 3.10, vLLM 0.8.2. Inference requires a powerful GPU setup (e.g., 8x GPUs for tensor_parallel_size=8 with gpu_memory_utilization=0.95).
Resources: Training requires significant computational resources. Inference with the provided Qwen-32B model demands substantial GPU memory.
Links: Paper, Blog, Datasets, Weights.

Highlighted Details

Achieves 50 points on AIME 2024 using Qwen2.5-32B, outperforming prior SoTA with fewer training steps.
Open-sources the DAPO algorithm, training infrastructure, and a 17k-sample math dataset (DAPO-Math-17k).
Provides inference code leveraging vLLM for efficient deployment.
Training scripts for DAPO variants are available, with one version verified to achieve 44 AIME points on vLLM.

Maintenance & Community

The project is a collaboration between ByteDance Seed and Tsinghua AIR. Discussions are welcomed via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The full DAPO algorithm's performance (AIME 50) was achieved on an internal codebase with heavy engineering optimizations based on verl, and has not yet been verified on the open-sourced verl framework. The README implies that the open-sourced training scripts might not fully replicate the top-tier results without these internal optimizations.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

41 stars in the last 30 days