DAPO  by BytedTsinghua-SIA

Open-source RL system for large-scale LLM training

Created 6 months ago
1,546 stars

Top 27.0% on SourcePulse

GitHubView on GitHub
Project Summary

DAPO is an open-source reinforcement learning system designed for large-scale LLM training, developed by ByteDance Seed and Tsinghua AIR. It provides a complete solution including algorithms, code infrastructure, and datasets, aiming to democratize access to advanced RL techniques for the research community.

How It Works

DAPO introduces a novel algorithm, Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO), which enhances stability and performance in LLM RL. The approach focuses on metric supervision during training, specifically monitoring length stability, reward score stability, and a controlled trend in entropy and mean probability. This ensures robust learning by balancing exploration and exploitation, preventing overfitting and promoting consistent performance gains.

Quick Start & Requirements

  • Install: Use conda create -n dapo python=3.10 and conda activate dapo, followed by pip3 install vllm==0.8.2.
  • Prerequisites: Python 3.10, vLLM 0.8.2. Inference requires a powerful GPU setup (e.g., 8x GPUs for tensor_parallel_size=8 with gpu_memory_utilization=0.95).
  • Resources: Training requires significant computational resources. Inference with the provided Qwen-32B model demands substantial GPU memory.
  • Links: Paper, Blog, Datasets, Weights.

Highlighted Details

  • Achieves 50 points on AIME 2024 using Qwen2.5-32B, outperforming prior SoTA with fewer training steps.
  • Open-sources the DAPO algorithm, training infrastructure, and a 17k-sample math dataset (DAPO-Math-17k).
  • Provides inference code leveraging vLLM for efficient deployment.
  • Training scripts for DAPO variants are available, with one version verified to achieve 44 AIME points on vLLM.

Maintenance & Community

The project is a collaboration between ByteDance Seed and Tsinghua AIR. Discussions are welcomed via GitHub issues.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The full DAPO algorithm's performance (AIME 50) was achieved on an internal codebase with heavy engineering optimizations based on verl, and has not yet been verified on the open-sourced verl framework. The README implies that the open-sourced training scripts might not fully replicate the top-tier results without these internal optimizations.

Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Starred by Hanlin Tang Hanlin Tang(CTO Neural Networks at Databricks; Cofounder of MosaicML), Amanpreet Singh Amanpreet Singh(Cofounder of Contextual AI), and
2 more.

coach by IntelLabs

0%
2k
Reinforcement learning framework for experimentation (discontinued)
Created 8 years ago
Updated 2 years ago
Feedback? Help us improve.