understand-r1-zero  by sail-sg

Research paper analyzing R1-Zero-like training for LLMs

Created 9 months ago
1,185 stars

Top 32.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides a critical analysis of R1-Zero-like training methodologies for Large Language Models (LLMs), focusing on base model selection and reinforcement learning (RL) techniques. It offers insights for researchers and practitioners aiming to optimize LLM alignment and reasoning capabilities, particularly in mathematical domains.

How It Works

The project critically examines two key components of R1-Zero-like training: base models and RL algorithms. It demonstrates that certain base models, like DeepSeek-V3-Base and Qwen2.5-Math, exhibit significant reasoning improvements even without specific prompt templates. The research also identifies bias issues in the GRPO algorithm, proposing "Dr. GRPO" as a fix that enhances token efficiency and maintains performance. The findings suggest that prompt templates and question sets interact to influence RL dynamics, with mismatched templates potentially degrading performance before RL reconstruction.

Quick Start & Requirements

  • Installation: pip install vllm==0.7.2 oat-llm==0.0.9 and pip install -e . after cloning the repository.
  • Prerequisites: Python 3.10 environment, vllm, oat-llm.
  • Training: Requires 8 x A100-40G GPUs for the example train_zero_math.py script.
  • Serving DeepSeek Models: Requires Kubernetes, 2 x 8 x H100/800/20 (FP8) or 4 x 8 x A100/A800 (BF16) GPUs.
  • Links: Paper, Models, OAT Framework.

Highlighted Details

  • Qwen2.5 base models show ~60% benchmark score improvement without prompt templates.
  • GRPO algorithm can lead to biased optimization; Dr. GRPO is proposed as a fix.
  • Minimalist R1-Zero recipe achieves state-of-the-art performance in 27 hours on 8 x A100 GPUs.
  • Includes scripts for training, evaluation, and serving models via SGLang.

Maintenance & Community

The project is led by Zichen Liu and includes core contributors from SAIL. It is associated with the OAT LLM framework. A Discord server is available for community discussion: Discord.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. However, it depends on frameworks like vLLM and OAT, and uses base models from Qwen, Llama, and DeepSeek, which have their own licenses. Compatibility for commercial use would require verifying the licenses of all constituent components.

Limitations & Caveats

The research focuses on specific mathematical reasoning tasks and models; generalizability to other domains may vary. The proposed Dr. GRPO is a simple fix for GRPO bias, and further research may be needed for broader applicability. Serving DeepSeek models requires significant hardware resources and Kubernetes infrastructure.

Health Check
Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
3 more.

ROLL by alibaba

2.3%
3k
RL library for large language models
Created 7 months ago
Updated 21 hours ago
Starred by Vincent Weisser Vincent Weisser(Cofounder of Prime Intellect), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
4 more.

simpleRL-reason by hkust-nlp

0.1%
4k
RL recipe for reasoning ability in models
Created 11 months ago
Updated 2 weeks ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
11 more.

TinyZero by Jiayi-Pan

0.2%
13k
Minimal reproduction of DeepSeek R1 Zero for countdown/multiplication tasks
Created 11 months ago
Updated 8 months ago
Feedback? Help us improve.