dflash-mlx  by bstnxbt

Speculative decoding for Apple Silicon MLX

Created 1 week ago

New!

480 stars

Top 63.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project implements DFlash speculative decoding for large language models on Apple Silicon using the MLX framework. It targets MLX users and researchers on Apple Silicon, offering significant speedups (up to 4.1x) by enabling models to generate multiple tokens in parallel and verify them efficiently, ensuring lossless output.

How It Works

DFlash employs a block-diffusion approach where a small draft model generates multiple tokens concurrently. A larger target model then verifies these tokens in a single forward pass. The process is "lossless" as every committed token is verified against the target model's output. Key technical innovations include "tape-replay rollback," which efficiently manages model state during verification by replaying accepted steps via a custom Metal kernel, and "JIT SDPA 2-pass" for numerically aligned long-context attention. Numerical coherence techniques stabilize bf16-sensitive paths to ensure consistency across speculative cycles.

Quick Start & Requirements

  • Install: pip install dflash-mlx or pipx install dflash-mlx.
  • Prerequisites: Apple Silicon hardware, MLX framework. Benchmarks were run using MLX 0.31.1.
  • Links:
    • Live Demo: PYTHONPATH=. python3 -m examples.demo --mode dflash --target-model Qwen/Qwen3.5-9B --draft-model z-lab/Qwen3.5-9B-DFlash --prompt "$PROMPT" --max-tokens 2048 --no-eos
    • Serving: dflash-serve (OpenAI-compatible)
    • Benchmarking: dflash-benchmark

Highlighted Details

  • Benchmarks demonstrate significant speedups, ranging from 1.35x to 4.1x across Qwen3.5 models (4B to 27B parameters) and context lengths (1024 to 4096 tokens), with acceptance rates consistently around 87-89%.
  • Provides an OpenAI-compatible server (dflash-serve) supporting streaming SSE, compatible with clients like Open WebUI, Continue, and aider.
  • Features auto draft model resolution from HuggingFace and is optimized for Qwen3.5 models, though other models can be used with explicit --draft flags.
  • The roadmap includes optimizations for sustained acceptance at 4096+ tokens and draft model distillation.

Maintenance & Community

The project outlines a roadmap for future optimizations. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.

Licensing & Compatibility

The project is released under the MIT license, generally permitting commercial use and modification.

Limitations & Caveats

Models without a corresponding DFlash draft on HuggingFace are rejected by default, requiring explicit specification via the --draft flag. Qwen3 models (pure attention) are supported but do not benefit from the precision enhancements of tape-replay rollback.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
5
Star History
493 stars in the last 8 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 2 years ago
Updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
1 more.

blt by facebookresearch

0%
2k
Code for Byte Latent Transformer research paper
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.