Open-source RL system for large-scale LLM training
Top 28.4% on sourcepulse
DAPO is an open-source reinforcement learning system designed for large-scale LLM training, developed by ByteDance Seed and Tsinghua AIR. It provides a complete solution including algorithms, code infrastructure, and datasets, aiming to democratize access to advanced RL techniques for the research community.
How It Works
DAPO introduces a novel algorithm, Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), which enhances stability and performance in LLM RL. The approach focuses on metric supervision during training, specifically monitoring length stability, reward score stability, and a controlled trend in entropy and mean probability. This ensures robust learning by balancing exploration and exploitation, preventing overfitting and promoting consistent performance gains.
Quick Start & Requirements
conda create -n dapo python=3.10
and conda activate dapo
, followed by pip3 install vllm==0.8.2
.tensor_parallel_size=8
with gpu_memory_utilization=0.95
).Highlighted Details
Maintenance & Community
The project is a collaboration between ByteDance Seed and Tsinghua AIR. Discussions are welcomed via GitHub issues.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The full DAPO algorithm's performance (AIME 50) was achieved on an internal codebase with heavy engineering optimizations based on verl, and has not yet been verified on the open-sourced verl framework. The README implies that the open-sourced training scripts might not fully replicate the top-tier results without these internal optimizations.
2 months ago
1 week