Discover and explore top open-source AI tools and projects—updated daily.
AryagmFast LLM inference via speculative decoding on Apple Silicon
New!
Top 86.4% on SourcePulse
Summary
Aryagm/dflash-mlx provides an exact speculative decoding implementation for large language models (LLMs) specifically optimized for Apple Silicon using the MLX framework. It enables significantly faster inference by reducing the number of forward passes required, making it beneficial for developers and researchers seeking to accelerate LLM deployment on Mac hardware without compromising output accuracy.
How It Works
This project ports the DFlash speculative decoding technique to MLX. DFlash trains a small "draft" model to predict multiple tokens concurrently. A larger "target" model then verifies these proposed tokens in a single forward pass, accepting the longest correct prefix. This drastically cuts down on sequential forward passes, boosting throughput. The MLX implementation is built natively on Metal, overcoming MLX's lack of built-in speculative decoding primitives. Key innovations include custom hidden-state extraction from the target model, parallel block proposal, single-pass batched verification, and precise per-layer KV cache rollback, ensuring bit-for-bit identical outputs to standard decoding.
Quick Start & Requirements
Installation involves cloning the repository and using uv for environment synchronization and execution: git clone https://github.com/aryagm/dflash-mlx.git && cd dflash-mlx && uv sync && uv run dflash-mlx --max-new-tokens 128. The first run downloads default checkpoints (Qwen3-4B BF16 and its DFlash draft), requiring approximately 12 GB of storage. MLX and Apple Silicon hardware are mandatory.
Highlighted Details
Maintenance & Community
The repository is maintained by Aryagm. No specific community channels (like Discord/Slack) or roadmap links are provided in the README.
Licensing & Compatibility
The project is released under the permissive MIT license, allowing for commercial use and integration into closed-source applications without significant restrictions.
Limitations & Caveats
Support for Qwen3.5 models is functional but incomplete and slower than Qwen3 due to its complex attention mechanisms requiring custom cache rollback logic. The MLX framework necessitated building speculative decoding primitives from scratch.
1 week ago
Inactive
exo-explore
Mega4alik