understand-r1-zero  by sail-sg

Research paper analyzing R1-Zero-like training for LLMs

created 4 months ago
1,049 stars

Top 36.5% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a critical analysis of R1-Zero-like training methodologies for Large Language Models (LLMs), focusing on base model selection and reinforcement learning (RL) techniques. It offers insights for researchers and practitioners aiming to optimize LLM alignment and reasoning capabilities, particularly in mathematical domains.

How It Works

The project critically examines two key components of R1-Zero-like training: base models and RL algorithms. It demonstrates that certain base models, like DeepSeek-V3-Base and Qwen2.5-Math, exhibit significant reasoning improvements even without specific prompt templates. The research also identifies bias issues in the GRPO algorithm, proposing "Dr. GRPO" as a fix that enhances token efficiency and maintains performance. The findings suggest that prompt templates and question sets interact to influence RL dynamics, with mismatched templates potentially degrading performance before RL reconstruction.

Quick Start & Requirements

  • Installation: pip install vllm==0.7.2 oat-llm==0.0.9 and pip install -e . after cloning the repository.
  • Prerequisites: Python 3.10 environment, vllm, oat-llm.
  • Training: Requires 8 x A100-40G GPUs for the example train_zero_math.py script.
  • Serving DeepSeek Models: Requires Kubernetes, 2 x 8 x H100/800/20 (FP8) or 4 x 8 x A100/A800 (BF16) GPUs.
  • Links: Paper, Models, OAT Framework.

Highlighted Details

  • Qwen2.5 base models show ~60% benchmark score improvement without prompt templates.
  • GRPO algorithm can lead to biased optimization; Dr. GRPO is proposed as a fix.
  • Minimalist R1-Zero recipe achieves state-of-the-art performance in 27 hours on 8 x A100 GPUs.
  • Includes scripts for training, evaluation, and serving models via SGLang.

Maintenance & Community

The project is led by Zichen Liu and includes core contributors from SAIL. It is associated with the OAT LLM framework. A Discord server is available for community discussion: Discord.

Licensing & Compatibility

The repository's licensing is not explicitly stated in the README. However, it depends on frameworks like vLLM and OAT, and uses base models from Qwen, Llama, and DeepSeek, which have their own licenses. Compatibility for commercial use would require verifying the licenses of all constituent components.

Limitations & Caveats

The research focuses on specific mathematical reasoning tasks and models; generalizability to other domains may vary. The proposed Dr. GRPO is a simple fix for GRPO bias, and further research may be needed for broader applicability. Serving DeepSeek models requires significant hardware resources and Kubernetes infrastructure.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
148 stars in the last 90 days

Explore Similar Projects

Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), and
5 more.

TinyZero by Jiayi-Pan

0.2%
12k
Minimal reproduction of DeepSeek R1 Zero for countdown/multiplication tasks
created 6 months ago
updated 3 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

open-r1 by huggingface

0.2%
25k
SDK for reproducing DeepSeek-R1
created 6 months ago
updated 3 days ago
Feedback? Help us improve.