DeepSeek-V2 by deepseek-ai

MoE language model for research/API use

Created 1 year ago

4,986 stars

Top 10.0% on SourcePulse

View on GitHub

4 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Jiaming Song

Chief Scientist at Luma AI

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Jiayi Pan

Author of SWE-Gym; MTS at xAI

Project Summary

DeepSeek-V2 is a powerful Mixture-of-Experts (MoE) language model designed for strong performance, economical training, and efficient inference. It targets researchers and developers needing high-quality language generation and coding capabilities, offering significant improvements in speed and cost reduction over previous models.

How It Works

DeepSeek-V2 employs two key architectural innovations: Multi-head Latent Attention (MLA) for efficient inference by compressing key-value pairs, and the DeepSeekMoE architecture for FFNs, enabling cost-effective training of larger models. This combination allows for a 236B total parameter model with only 21B activated per token, drastically reducing KV cache requirements and boosting throughput.

Quick Start & Requirements

Hugging Face Transformers: Requires 8x75GB GPUs for BF16 inference.

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory = {i: "75GB" for i in range(8)}
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")

SGLang (Recommended): Supports MLA, FP8, and Torch Compile for optimal performance.

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Chat --tp 8 --trust-remote-code

vLLM (Recommended): Requires merging a specific PR.

from vllm import LLM, SamplingParams
llm = LLM(model="deepseek-ai/DeepSeek-V2-Chat", tensor_parallel_size=8, max_model_len=8192, trust_remote_code=True)

Dependencies: Python, PyTorch, Transformers, vLLM, or SGLang.

Highlighted Details

236B total parameters, 21B activated per token.
Supports up to 128k context length.
Achieves state-of-the-art performance on Chinese benchmarks (C-Eval, CMMLU) and competitive results on English and coding tasks.
Offers a 16B parameter "Lite" version for reduced resource requirements.

Maintenance & Community

Developed by DeepSeek-AI.
Official chat available at chat.deepseek.com.
OpenAI-Compatible API available at platform.deepseek.com.

Licensing & Compatibility

Code licensed under MIT.
Model usage subject to Model License.
Supports commercial use for DeepSeek-V2 Base/Chat models.

Limitations & Caveats

The Hugging Face implementation may exhibit slower performance compared to the internal codebase. For optimal inference, using SGLang or vLLM with specific configurations is recommended.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

10 stars in the last 30 days