Discover and explore top open-source AI tools and projects—updated daily.
antirezExperimental LLM inference engine for efficient local deployment
New!
Top 93.8% on SourcePulse
Summary
This repository provides an experimental fork of llama.cpp implementing DeepSeek v4 Flash. It targets users seeking to run advanced LLMs on consumer hardware, particularly MacBooks with 128GB RAM, by leveraging aggressive 2-bit quantization for routed experts. The primary benefit is enabling "frontier-model vibes" chat performance with significantly reduced memory requirements.
How It Works
The core innovation lies in adapting the llama.cpp inference engine to support DeepSeek v4 Flash, generating GGUF model files optimized for low-resource environments. It employs 2-bit quantization specifically for the model's routed experts, drastically cutting memory usage. The implementation supports both CPU and Metal backends, with Metal offering superior inference speeds on compatible Apple hardware.
Quick Start & Requirements
Users can download pre-quantized GGUF models from https://huggingface.co/antirez/deepseek-v4-gguf. Inference is initiated via the llama-cli tool using the command llama-cli -m <model_file>. The specific quantized model targets 128GB of RAM. For broader installation options and detailed build guides, refer to the main llama.cpp project documentation.
Highlighted Details
llama.cpp.Maintenance & Community
This is an experimental fork. While it benefits from the robust community and development of the parent llama.cpp project, specific maintenance details or dedicated community channels for this fork are not detailed in the README.
Licensing & Compatibility
This project is a fork of llama.cpp, which is typically distributed under the MIT License. The listed dependencies also use permissive licenses (MIT, Public Domain). Therefore, it is likely compatible with commercial use, though explicit confirmation for this specific fork is recommended.
Limitations & Caveats
This implementation is explicitly labeled as experimental and the quantized model has not undergone extensive testing. The code was developed with significant assistance from AI models (GPT 5.5) and uses the official DeepSeek v4 Flash as a reference.
3 weeks ago
Inactive
b4rtaz
nomic-ai