deepseek.cpp  by andrewkchan

CPU inference for DeepSeek LLMs in C++

Created 8 months ago
313 stars

Top 86.1% on SourcePulse

GitHubView on GitHub
Project Summary

This C++ project provides CPU-only inference for the DeepSeek family of large language models, targeting users who need efficient, hackable, and self-contained LLM execution without GPU dependencies. It offers a lean alternative to larger inference engines, enabling focused study of DeepSeek model performance on CPU.

How It Works

The implementation is based on Yet Another Language Model (YALM) and is specifically tailored for DeepSeek architectures. It utilizes custom quantization methods like f8e5m2 (128x128 blocks with full precision MoE gates and layer norms) and q2_k (llama.cpp's 2-bit K-quantization) to optimize CPU performance and memory usage. The project prioritizes simplicity and hackability, with a significantly smaller codebase compared to other inference engines.

Quick Start & Requirements

  • Install: pip install . (after cloning the repo and installing git-lfs and build tools).
  • Prerequisites: C++20-compatible compiler, Python 3.x, git-lfs, python3-dev, build-essential.
  • Model Conversion: Requires Hugging Face format safetensor weights, converted using python convert.py --quant <quant_type> <model_dir>.
  • Execution: ./build/main <model_weights_dir> -i "prompt"
  • Performance Tuning: OMP_NUM_THREADS environment variable is crucial for optimal throughput.
  • Resources: DeepSeek V3 (F8E5M2) requires ~650GB RAM; Q2_K requires ~206GB RAM.
  • Docs: CLI help available via ./build/main -h.

Highlighted Details

  • CPU-only inference for DeepSeek models.
  • Custom quantization methods (f8e5m2, q2_k) for accuracy and efficiency.
  • Small codebase (<2k LOC excluding dependencies), emphasizing hackability.
  • Supports various DeepSeek model versions and quantization types (FP16, FP32, Q2_K, F8E5M2).

Maintenance & Community

This is a personal side project for learning and experimentation. Contributions (PRs) are welcome.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Only decoding (incremental generation) is implemented; prefill operations and optimizations like speculative decoding are missing. Some DeepSeek V3 architectural features are not yet implemented, potentially impacting accuracy. Models may exhibit repetitive behavior at low temperatures.

Health Check
Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Jared Palmer Jared Palmer(Ex-VP AI at Vercel; Founder of Turborepo; Author of Formik, TSDX), Eugene Yan Eugene Yan(AI Scientist at AWS), and
2 more.

starcoder.cpp by bigcode-project

0%
456
C++ example for StarCoder inference
Created 2 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
6 more.

xTuring by stochasticai

0.0%
3k
SDK for fine-tuning and customizing open-source LLMs
Created 2 years ago
Updated 1 day ago
Feedback? Help us improve.