llama.cpp-deepseek-v4-flash  by antirez

Experimental LLM inference engine for efficient local deployment

Created 4 weeks ago

New!

276 stars

Top 93.8% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides an experimental fork of llama.cpp implementing DeepSeek v4 Flash. It targets users seeking to run advanced LLMs on consumer hardware, particularly MacBooks with 128GB RAM, by leveraging aggressive 2-bit quantization for routed experts. The primary benefit is enabling "frontier-model vibes" chat performance with significantly reduced memory requirements.

How It Works

The core innovation lies in adapting the llama.cpp inference engine to support DeepSeek v4 Flash, generating GGUF model files optimized for low-resource environments. It employs 2-bit quantization specifically for the model's routed experts, drastically cutting memory usage. The implementation supports both CPU and Metal backends, with Metal offering superior inference speeds on compatible Apple hardware.

Quick Start & Requirements

Users can download pre-quantized GGUF models from https://huggingface.co/antirez/deepseek-v4-gguf. Inference is initiated via the llama-cli tool using the command llama-cli -m <model_file>. The specific quantized model targets 128GB of RAM. For broader installation options and detailed build guides, refer to the main llama.cpp project documentation.

Highlighted Details

  • Experimental DeepSeek v4 Flash support integrated into llama.cpp.
  • Optimized GGUF generation for 128GB RAM MacBooks via 2-bit routed expert quantization.
  • Reported to offer excellent chat performance ("frontier-model vibes").
  • Backend support includes CPU and accelerated Metal (for Apple Silicon).

Maintenance & Community

This is an experimental fork. While it benefits from the robust community and development of the parent llama.cpp project, specific maintenance details or dedicated community channels for this fork are not detailed in the README.

Licensing & Compatibility

This project is a fork of llama.cpp, which is typically distributed under the MIT License. The listed dependencies also use permissive licenses (MIT, Public Domain). Therefore, it is likely compatible with commercial use, though explicit confirmation for this specific fork is recommended.

Limitations & Caveats

This implementation is explicitly labeled as experimental and the quantized model has not undergone extensive testing. The code was developed with significant assistance from AI models (GPT 5.5) and uses the official DeepSeek v4 Flash as a reference.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
6
Star History
277 stars in the last 28 days

Explore Similar Projects

Feedback? Help us improve.