deepseek.cpp by andrewkchan

CPU inference for DeepSeek LLMs in C++

Created 1 year ago

317 stars

Top 85.4% on SourcePulse

Project Summary

This C++ project provides CPU-only inference for the DeepSeek family of large language models, targeting users who need efficient, hackable, and self-contained LLM execution without GPU dependencies. It offers a lean alternative to larger inference engines, enabling focused study of DeepSeek model performance on CPU.

How It Works

The implementation is based on Yet Another Language Model (YALM) and is specifically tailored for DeepSeek architectures. It utilizes custom quantization methods like f8e5m2 (128x128 blocks with full precision MoE gates and layer norms) and q2_k (llama.cpp's 2-bit K-quantization) to optimize CPU performance and memory usage. The project prioritizes simplicity and hackability, with a significantly smaller codebase compared to other inference engines.

Quick Start & Requirements

Install: pip install . (after cloning the repo and installing git-lfs and build tools).
Prerequisites: C++20-compatible compiler, Python 3.x, git-lfs, python3-dev, build-essential.
Model Conversion: Requires Hugging Face format safetensor weights, converted using python convert.py --quant <quant_type> <model_dir>.
Execution: ./build/main <model_weights_dir> -i "prompt"
Performance Tuning: OMP_NUM_THREADS environment variable is crucial for optimal throughput.
Resources: DeepSeek V3 (F8E5M2) requires ~650GB RAM; Q2_K requires ~206GB RAM.
Docs: CLI help available via ./build/main -h.

Highlighted Details

CPU-only inference for DeepSeek models.
Custom quantization methods (f8e5m2, q2_k) for accuracy and efficiency.
Small codebase (<2k LOC excluding dependencies), emphasizing hackability.
Supports various DeepSeek model versions and quantization types (FP16, FP32, Q2_K, F8E5M2).

Maintenance & Community

This is a personal side project for learning and experimentation. Contributions (PRs) are welcome.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Only decoding (incremental generation) is implemented; prefill operations and optimizations like speculative decoding are missing. Some DeepSeek V3 architectural features are not yet implemented, potentially impacting accuracy. Models may exhibit repetitive behavior at low temperatures.

deepseek.cpp by andrewkchan

Explore Similar Projects

EXAONE-Deep by LG-AI-EXAONE

vit.cpp by staghado

starcoder.cpp by bigcode-project

llguidance by guidance-ai

clip.cpp by monatis

xFasterTransformer by intel

InferLLM by MegEngine

CodeTF by salesforce

KuiperLLama by zjhellofss

ik_llama.cpp by ikawrakow

FlagGems by flagos-ai

xTuring by stochasticai