dflash-mlx by Aryagm

Fast LLM inference via speculative decoding on Apple Silicon

Created 2 months ago

376 stars

Top 75.4% on SourcePulse

Project Summary

Summary

Aryagm/dflash-mlx provides an exact speculative decoding implementation for large language models (LLMs) specifically optimized for Apple Silicon using the MLX framework. It enables significantly faster inference by reducing the number of forward passes required, making it beneficial for developers and researchers seeking to accelerate LLM deployment on Mac hardware without compromising output accuracy.

How It Works

This project ports the DFlash speculative decoding technique to MLX. DFlash trains a small "draft" model to predict multiple tokens concurrently. A larger "target" model then verifies these proposed tokens in a single forward pass, accepting the longest correct prefix. This drastically cuts down on sequential forward passes, boosting throughput. The MLX implementation is built natively on Metal, overcoming MLX's lack of built-in speculative decoding primitives. Key innovations include custom hidden-state extraction from the target model, parallel block proposal, single-pass batched verification, and precise per-layer KV cache rollback, ensuring bit-for-bit identical outputs to standard decoding.

Quick Start & Requirements

Installation involves cloning the repository and using uv for environment synchronization and execution: git clone https://github.com/aryagm/dflash-mlx.git && cd dflash-mlx && uv sync && uv run dflash-mlx --max-new-tokens 128. The first run downloads default checkpoints (Qwen3-4B BF16 and its DFlash draft), requiring approximately 12 GB of storage. MLX and Apple Silicon hardware are mandatory.

Highlighted Details

Achieves bit-for-bit identical output to standard decoding through exact speculative verification.
Native MLX implementation optimized for Apple Silicon's Metal backend.
Features a pluggable adapter system to easily integrate support for new LLM families.
Detailed benchmarks comparing performance and acceptance stats are available in the repository.

Maintenance & Community

The repository is maintained by Aryagm. No specific community channels (like Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the permissive MIT license, allowing for commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

Support for Qwen3.5 models is functional but incomplete and slower than Qwen3 due to its complex attention mechanisms requiring custom cache rollback logic. The MLX framework necessitated building speculative decoding primitives from scratch.

dflash-mlx by Aryagm

Explore Similar Projects

orthrus by chiennv2000

mlx-bitnet by exo-explore

MTPLX by youssofal

Nanoflow by efeslab

dflash-mlx by bstnxbt

atlas by Avarok-Cybersecurity

vllm-metal by vllm-project

turboquant by 0xSero

picolm by RightNow-AI

lucebox-hub by Luce-Org

ollm by Mega4alik

LiteRT-LM by google-ai-edge