CPU inference for DeepSeek LLMs in C++
Top 88.2% on sourcepulse
This C++ project provides CPU-only inference for the DeepSeek family of large language models, targeting users who need efficient, hackable, and self-contained LLM execution without GPU dependencies. It offers a lean alternative to larger inference engines, enabling focused study of DeepSeek model performance on CPU.
How It Works
The implementation is based on Yet Another Language Model (YALM) and is specifically tailored for DeepSeek architectures. It utilizes custom quantization methods like f8e5m2
(128x128 blocks with full precision MoE gates and layer norms) and q2_k
(llama.cpp's 2-bit K-quantization) to optimize CPU performance and memory usage. The project prioritizes simplicity and hackability, with a significantly smaller codebase compared to other inference engines.
Quick Start & Requirements
pip install .
(after cloning the repo and installing git-lfs and build tools).python3-dev
, build-essential
.python convert.py --quant <quant_type> <model_dir>
../build/main <model_weights_dir> -i "prompt"
OMP_NUM_THREADS
environment variable is crucial for optimal throughput../build/main -h
.Highlighted Details
f8e5m2
, q2_k
) for accuracy and efficiency.Maintenance & Community
This is a personal side project for learning and experimentation. Contributions (PRs) are welcome.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Only decoding (incremental generation) is implemented; prefill operations and optimizations like speculative decoding are missing. Some DeepSeek V3 architectural features are not yet implemented, potentially impacting accuracy. Models may exhibit repetitive behavior at low temperatures.
1 month ago
1 day