Replicate DeepSeek R1 LLM training from scratch
Top 51.4% on sourcepulse
This repository provides a step-by-step guide and code to replicate the DeepSeek R1 reasoning model training process. It targets engineers and researchers interested in understanding and implementing advanced reinforcement learning techniques for LLMs, specifically focusing on improving reasoning capabilities. The project aims to demystify the complex training pipeline of DeepSeek R1 by offering a practical, code-driven explanation with simplified components.
How It Works
The project breaks down the DeepSeek R1 training into manageable stages, starting with a GRPO (Gradient Reward Policy Optimization) based approach for an initial "R1 Zero" model. This involves using a smaller base model (Qwen2.5-0.5B-Instruct) and applying multiple reward functions (accuracy, format, reasoning steps, cosine scaling, repetition penalty) to guide the learning process. Following this, it details Supervised Fine-Tuning (SFT) using curated datasets like Bespoke-Stratos-17k to improve reasoning clarity and language consistency, addressing issues found in R1 Zero. The theoretical aspects of subsequent RL stages and distillation are also covered.
Quick Start & Requirements
pip install -r requirements.txt
.Highlighted Details
datasets
and trl
libraries for efficient data handling and training.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 months ago
1 day