dflash-mlx  by Aryagm

Fast LLM inference via speculative decoding on Apple Silicon

Created 1 week ago

New!

311 stars

Top 86.4% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Aryagm/dflash-mlx provides an exact speculative decoding implementation for large language models (LLMs) specifically optimized for Apple Silicon using the MLX framework. It enables significantly faster inference by reducing the number of forward passes required, making it beneficial for developers and researchers seeking to accelerate LLM deployment on Mac hardware without compromising output accuracy.

How It Works

This project ports the DFlash speculative decoding technique to MLX. DFlash trains a small "draft" model to predict multiple tokens concurrently. A larger "target" model then verifies these proposed tokens in a single forward pass, accepting the longest correct prefix. This drastically cuts down on sequential forward passes, boosting throughput. The MLX implementation is built natively on Metal, overcoming MLX's lack of built-in speculative decoding primitives. Key innovations include custom hidden-state extraction from the target model, parallel block proposal, single-pass batched verification, and precise per-layer KV cache rollback, ensuring bit-for-bit identical outputs to standard decoding.

Quick Start & Requirements

Installation involves cloning the repository and using uv for environment synchronization and execution: git clone https://github.com/aryagm/dflash-mlx.git && cd dflash-mlx && uv sync && uv run dflash-mlx --max-new-tokens 128. The first run downloads default checkpoints (Qwen3-4B BF16 and its DFlash draft), requiring approximately 12 GB of storage. MLX and Apple Silicon hardware are mandatory.

Highlighted Details

  • Achieves bit-for-bit identical output to standard decoding through exact speculative verification.
  • Native MLX implementation optimized for Apple Silicon's Metal backend.
  • Features a pluggable adapter system to easily integrate support for new LLM families.
  • Detailed benchmarks comparing performance and acceptance stats are available in the repository.

Maintenance & Community

The repository is maintained by Aryagm. No specific community channels (like Discord/Slack) or roadmap links are provided in the README.

Licensing & Compatibility

The project is released under the permissive MIT license, allowing for commercial use and integration into closed-source applications without significant restrictions.

Limitations & Caveats

Support for Qwen3.5 models is functional but incomplete and slower than Qwen3 due to its complex attention mechanisms requiring custom cache rollback logic. The MLX framework necessitated building speculative decoding primitives from scratch.

Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
3
Issues (30d)
2
Star History
314 stars in the last 9 days

Explore Similar Projects

Feedback? Help us improve.