DeepSeek-R1 by deepseek-ai

Reasoning models research paper

Created 11 months ago

91,674 stars

Top 0.1% on SourcePulse

View on GitHub

21 Experts Love This Project

Michael Han

Cofounder of Unsloth

Sebastian Raschka

Author of "Build a Large Language Model (From Scratch)"

and 17 more!

Project Summary

DeepSeek-R1 is a family of large language models focused on enhancing reasoning capabilities, particularly through reinforcement learning (RL). It offers both large Mixture-of-Experts (MoE) models (DeepSeek-R1-Zero and DeepSeek-R1) and smaller, distilled dense models based on Llama and Qwen architectures, targeting researchers and developers seeking advanced reasoning performance.

How It Works

The core innovation lies in applying RL directly to base models without initial supervised fine-tuning (SFT), enabling emergent reasoning behaviors like self-verification and long chain-of-thought generation. DeepSeek-R1 further refines this with a multi-stage RL and SFT pipeline. Distillation techniques are then used to transfer these reasoning patterns into smaller, more accessible dense models, achieving state-of-the-art results for their size.

Quick Start & Requirements

DeepSeek-R1 Models: Refer to the DeepSeek-V3 repository for local execution details. Hugging Face Transformers is not directly supported.
DeepSeek-R1-Distill Models: Can be served using vLLM (vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager) or SGLang (python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2).
Prerequisites: Python 3.x, vLLM or SGLang. Large models require significant GPU resources.
Usage Recommendations: Set temperature to 0.5-0.7, avoid system prompts, and prepend "<think>\n" to user prompts for consistent reasoning.

Highlighted Details

DeepSeek-R1 (671B MoE) achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini on various benchmarks.
Models support context lengths up to 128K tokens.
Distilled models are available for Qwen2.5 (1.5B, 7B, 14B, 32B) and Llama 3/3.1 (8B, 70B).

Maintenance & Community

Developed by DeepSeek-AI.
Contact via email: service@deepseek.com or by raising an issue on GitHub.
Paper available on arXiv: https://arxiv.org/abs/2501.12948.

Licensing & Compatibility

MIT License for the code repository and model weights.
DeepSeek-R1 series support commercial use and modifications.
Distilled models inherit licenses from their base models: Qwen2.5 (Apache 2.0) and Llama 3/3.1 (Llama license).

Limitations & Caveats

Hugging Face Transformers is not directly supported for the base R1 models. Specific prompt formatting and configuration are recommended to achieve optimal reasoning performance, including prepending "<think>\n" to enforce reasoning steps.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

317 stars in the last 30 days