ds4  by antirez

Fast local inference for DeepSeek V4 Flash models

Created 3 days ago

New!

3,005 stars

Top 15.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary ds4 is a specialized, native inference engine for DeepSeek V4 Flash, optimized for Apple's Metal GPU. It enables high-performance local LLM execution on high-end Macs, featuring long context windows and efficient on-disk KV cache persistence.

How It Works This project utilizes a DeepSeek V4 Flash-specific Metal graph executor with custom loading and KV state management. Its advantage lies in DeepSeek V4 Flash's speed, proportional thinking, and 1 million token context window. A key innovation is treating the KV cache as a "first-class disk citizen," leveraging fast SSDs for long-context persistence, with development significantly assisted by GPT 5.5.

Quick Start & Requirements

  • Installation: Build from source (make). Download specific DeepSeek V4 Flash GGUF models via ./download_model.sh (e.g., q2 for 128GB RAM, q4 for >= 256GB RAM).
  • Execution: Run CLI inference with ./ds4 or start a server with ./ds4-server.
  • Prerequisites: macOS with Metal GPU. Requires specific DeepSeek V4 Flash GGUF files; not a generic GGUF runner.
  • Hardware: Recommended for high-end Macs (128GB+ RAM).

Highlighted Details

  • Long Context: Supports a 1 million token context window.
  • Disk KV Cache: Implements on-disk KV cache persistence for efficient long-context management.
  • Optimized Quantization: Achieves good performance with 2-bit quantization on consumer hardware.
  • Agent Integration: Provides OpenAI/Anthropic-compatible server APIs for seamless integration.

Maintenance & Community The project acknowledges contributions from the llama.cpp community and notes significant AI assistance from GPT 5.5. No specific community channels or roadmap links are provided.

Licensing & Compatibility While adapted pieces use the MIT license and GGML authors' copyright is noted, the overall project license is not explicitly stated. It is strictly compatible only with the project's specially crafted DeepSeek V4 Flash GGUF files.

Limitations & Caveats This is "alpha quality code" primarily for Metal GPUs; the CPU path is unstable and for debugging only. The engine is model-specific and developed with significant AI assistance.

Health Check
Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)
18
Issues (30d)
19
Star History
3,932 stars in the last 3 days

Explore Similar Projects

Feedback? Help us improve.