DAPO  by BytedTsinghua-SIA

Open-source RL system for large-scale LLM training

created 4 months ago
1,473 stars

Top 28.4% on sourcepulse

GitHubView on GitHub
Project Summary

DAPO is an open-source reinforcement learning system designed for large-scale LLM training, developed by ByteDance Seed and Tsinghua AIR. It provides a complete solution including algorithms, code infrastructure, and datasets, aiming to democratize access to advanced RL techniques for the research community.

How It Works

DAPO introduces a novel algorithm, Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), which enhances stability and performance in LLM RL. The approach focuses on metric supervision during training, specifically monitoring length stability, reward score stability, and a controlled trend in entropy and mean probability. This ensures robust learning by balancing exploration and exploitation, preventing overfitting and promoting consistent performance gains.

Quick Start & Requirements

  • Install: Use conda create -n dapo python=3.10 and conda activate dapo, followed by pip3 install vllm==0.8.2.
  • Prerequisites: Python 3.10, vLLM 0.8.2. Inference requires a powerful GPU setup (e.g., 8x GPUs for tensor_parallel_size=8 with gpu_memory_utilization=0.95).
  • Resources: Training requires significant computational resources. Inference with the provided Qwen-32B model demands substantial GPU memory.
  • Links: Paper, Blog, Datasets, Weights.

Highlighted Details

  • Achieves 50 points on AIME 2024 using Qwen2.5-32B, outperforming prior SoTA with fewer training steps.
  • Open-sources the DAPO algorithm, training infrastructure, and a 17k-sample math dataset (DAPO-Math-17k).
  • Provides inference code leveraging vLLM for efficient deployment.
  • Training scripts for DAPO variants are available, with one version verified to achieve 44 AIME points on vLLM.

Maintenance & Community

The project is a collaboration between ByteDance Seed and Tsinghua AIR. Discussions are welcomed via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The full DAPO algorithm's performance (AIME 50) was achieved on an internal codebase with heavy engineering optimizations based on verl, and has not yet been verified on the open-sourced verl framework. The README implies that the open-sourced training scripts might not fully replicate the top-tier results without these internal optimizations.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
292 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.