dflash-mlx by bstnxbt

Speculative decoding for Apple Silicon MLX

Created 2 months ago

726 stars

Top 46.9% on SourcePulse

Project Summary

This project implements DFlash speculative decoding for large language models on Apple Silicon using the MLX framework. It targets MLX users and researchers on Apple Silicon, offering significant speedups (up to 4.1x) by enabling models to generate multiple tokens in parallel and verify them efficiently, ensuring lossless output.

How It Works

DFlash employs a block-diffusion approach where a small draft model generates multiple tokens concurrently. A larger target model then verifies these tokens in a single forward pass. The process is "lossless" as every committed token is verified against the target model's output. Key technical innovations include "tape-replay rollback," which efficiently manages model state during verification by replaying accepted steps via a custom Metal kernel, and "JIT SDPA 2-pass" for numerically aligned long-context attention. Numerical coherence techniques stabilize bf16-sensitive paths to ensure consistency across speculative cycles.

Quick Start & Requirements

Install: pip install dflash-mlx or pipx install dflash-mlx.
Prerequisites: Apple Silicon hardware, MLX framework. Benchmarks were run using MLX 0.31.1.
Links:
- Live Demo: PYTHONPATH=. python3 -m examples.demo --mode dflash --target-model Qwen/Qwen3.5-9B --draft-model z-lab/Qwen3.5-9B-DFlash --prompt "$PROMPT" --max-tokens 2048 --no-eos
- Serving: dflash-serve (OpenAI-compatible)
- Benchmarking: dflash-benchmark

Highlighted Details

Benchmarks demonstrate significant speedups, ranging from 1.35x to 4.1x across Qwen3.5 models (4B to 27B parameters) and context lengths (1024 to 4096 tokens), with acceptance rates consistently around 87-89%.
Provides an OpenAI-compatible server (dflash-serve) supporting streaming SSE, compatible with clients like Open WebUI, Continue, and aider.
Features auto draft model resolution from HuggingFace and is optimized for Qwen3.5 models, though other models can be used with explicit --draft flags.
The roadmap includes optimizations for sustained acceptance at 4096+ tokens and draft model distillation.

Maintenance & Community

The project outlines a roadmap for future optimizations. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT license, generally permitting commercial use and modification.

Limitations & Caveats

Models without a corresponding DFlash draft on HuggingFace are rejected by default, requiring explicit specification via the --draft flag. Qwen3 models (pure attention) are supported but do not benefit from the precision enhancements of tape-replay rollback.

dflash-mlx by bstnxbt

Explore Similar Projects

orthrus by chiennv2000

TriForce by Infini-AI-Lab

mlx-bitnet by exo-explore

dflash-mlx by Aryagm

MTPLX by youssofal

atlas by Avarok-Cybersecurity

OSCAR by FutureMLS-Lab

MinivLLM by Wenyueh

vllm-metal by vllm-project

picolm by RightNow-AI

lucebox-hub by Luce-Org

ds4 by antirez