understand-r1-zero by sail-sg

Research paper analyzing R1-Zero-like training for LLMs

Created 9 months ago

1,185 stars

Top 32.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Lewis Tunstall

Research Engineer at Hugging Face

Vincent Weisser

Cofounder of Prime Intellect

Pawel Garbacki

Cofounder of Fireworks AI

Project Summary

This repository provides a critical analysis of R1-Zero-like training methodologies for Large Language Models (LLMs), focusing on base model selection and reinforcement learning (RL) techniques. It offers insights for researchers and practitioners aiming to optimize LLM alignment and reasoning capabilities, particularly in mathematical domains.

How It Works

The project critically examines two key components of R1-Zero-like training: base models and RL algorithms. It demonstrates that certain base models, like DeepSeek-V3-Base and Qwen2.5-Math, exhibit significant reasoning improvements even without specific prompt templates. The research also identifies bias issues in the GRPO algorithm, proposing "Dr. GRPO" as a fix that enhances token efficiency and maintains performance. The findings suggest that prompt templates and question sets interact to influence RL dynamics, with mismatched templates potentially degrading performance before RL reconstruction.

Quick Start & Requirements

Installation: pip install vllm==0.7.2 oat-llm==0.0.9 and pip install -e . after cloning the repository.
Prerequisites: Python 3.10 environment, vllm, oat-llm.
Training: Requires 8 x A100-40G GPUs for the example train_zero_math.py script.
Serving DeepSeek Models: Requires Kubernetes, 2 x 8 x H100/800/20 (FP8) or 4 x 8 x A100/A800 (BF16) GPUs.
Links: Paper, Models, OAT Framework.

Highlighted Details

Qwen2.5 base models show ~60% benchmark score improvement without prompt templates.
GRPO algorithm can lead to biased optimization; Dr. GRPO is proposed as a fix.
Minimalist R1-Zero recipe achieves state-of-the-art performance in 27 hours on 8 x A100 GPUs.
Includes scripts for training, evaluation, and serving models via SGLang.

Maintenance & Community

The project is led by Zichen Liu and includes core contributors from SAIL. It is associated with the OAT LLM framework. A Discord server is available for community discussion: Discord.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. However, it depends on frameworks like vLLM and OAT, and uses base models from Qwen, Llama, and DeepSeek, which have their own licenses. Compatibility for commercial use would require verifying the licenses of all constituent components.

Limitations & Caveats

The research focuses on specific mathematical reasoning tasks and models; generalizability to other domains may vary. The proposed Dr. GRPO is a simple fix for GRPO bias, and further research may be needed for broader applicability. Serving DeepSeek models requires significant hardware resources and Kubernetes infrastructure.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days