MoE language model for research/API use
Top 10.2% on sourcepulse
DeepSeek-V2 is a powerful Mixture-of-Experts (MoE) language model designed for strong performance, economical training, and efficient inference. It targets researchers and developers needing high-quality language generation and coding capabilities, offering significant improvements in speed and cost reduction over previous models.
How It Works
DeepSeek-V2 employs two key architectural innovations: Multi-head Latent Attention (MLA) for efficient inference by compressing key-value pairs, and the DeepSeekMoE architecture for FFNs, enabling cost-effective training of larger models. This combination allows for a 236B total parameter model with only 21B activated per token, drastically reducing KV cache requirements and boosting throughput.
Quick Start & Requirements
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "deepseek-ai/DeepSeek-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
max_memory = {i: "75GB" for i in range(8)}
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="sequential", torch_dtype=torch.bfloat16, max_memory=max_memory, attn_implementation="eager")
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V2-Chat --tp 8 --trust-remote-code
from vllm import LLM, SamplingParams
llm = LLM(model="deepseek-ai/DeepSeek-V2-Chat", tensor_parallel_size=8, max_model_len=8192, trust_remote_code=True)
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The Hugging Face implementation may exhibit slower performance compared to the internal codebase. For optimal inference, using SGLang or vLLM with specific configurations is recommended.
10 months ago
1 week