Discover and explore top open-source AI tools and projects—updated daily.
bstnxbtSpeculative decoding for Apple Silicon MLX
New!
Top 63.8% on SourcePulse
This project implements DFlash speculative decoding for large language models on Apple Silicon using the MLX framework. It targets MLX users and researchers on Apple Silicon, offering significant speedups (up to 4.1x) by enabling models to generate multiple tokens in parallel and verify them efficiently, ensuring lossless output.
How It Works
DFlash employs a block-diffusion approach where a small draft model generates multiple tokens concurrently. A larger target model then verifies these tokens in a single forward pass. The process is "lossless" as every committed token is verified against the target model's output. Key technical innovations include "tape-replay rollback," which efficiently manages model state during verification by replaying accepted steps via a custom Metal kernel, and "JIT SDPA 2-pass" for numerically aligned long-context attention. Numerical coherence techniques stabilize bf16-sensitive paths to ensure consistency across speculative cycles.
Quick Start & Requirements
pip install dflash-mlx or pipx install dflash-mlx.PYTHONPATH=. python3 -m examples.demo --mode dflash --target-model Qwen/Qwen3.5-9B --draft-model z-lab/Qwen3.5-9B-DFlash --prompt "$PROMPT" --max-tokens 2048 --no-eosdflash-serve (OpenAI-compatible)dflash-benchmarkHighlighted Details
dflash-serve) supporting streaming SSE, compatible with clients like Open WebUI, Continue, and aider.--draft flags.Maintenance & Community
The project outlines a roadmap for future optimizations. No specific community channels (e.g., Discord, Slack) or notable contributors/sponsorships are detailed in the provided text.
Licensing & Compatibility
The project is released under the MIT license, generally permitting commercial use and modification.
Limitations & Caveats
Models without a corresponding DFlash draft on HuggingFace are rejected by default, requiring explicit specification via the --draft flag. Qwen3 models (pure attention) are supported but do not benefit from the precision enhancements of tape-replay rollback.
1 day ago
Inactive
exo-explore
hao-ai-lab
facebookresearch
Mega4alik