Discover and explore top open-source AI tools and projects—updated daily.
youssofalNative MTP speculative decoding for accelerated LLM inference
New!
Top 53.0% on SourcePulse
MTPLX addresses the challenge of high-latency LLM inference, particularly with speculative decoding methods that fail at non-zero temperatures. It offers a native MTP speculative decoding solution specifically for Apple Silicon, enabling significant speedups without sacrificing output quality. This project benefits engineers and researchers seeking to maximize LLM performance on macOS hardware.
How It Works
The core innovation is native MTP speculative decoding, leveraging the target model's own MTP heads as a speculative drafter, thus avoiding the RAM overhead of a second model. MTPLX employs mathematically precise probability-ratio acceptance with residual correction, ensuring accurate sampling even at temperatures above zero, a key differentiator from greedy-argmax approaches. This MLX-native runtime is built for Apple Silicon and integrates a full OpenAI/Anthropic-compatible serving stack.
Quick Start & Requirements
Installation is straightforward via Homebrew (brew install youssofal/mtplx/mtplx) or pip (python3 -m pip install -U mtplx). The project requires macOS with Apple Silicon and Python 3.11+. An interactive wizard (mtplx start) guides users through model selection, runtime configuration, and launching the serving surface, simplifying setup.
Highlighted Details
temp=0.6.Maintenance & Community
The project is primarily developed by Youssof Altoukhi. Contributions, bug reports, and benchmark replications are welcomed via GitHub Issues. Specific community channels like Discord or Slack are not detailed in the README.
Licensing & Compatibility
MTPLX is released under the permissive Apache License 2.0, allowing commercial use, modification, and distribution with attribution. It is strictly limited to Apple Silicon (macOS) and does not support Linux/CUDA, with no roadmap for such expansion.
Limitations & Caveats
The project is exclusively for Apple Silicon hardware and requires macOS 14.0+. Linux/CUDA users should consider alternatives like vLLM. Certain advanced features, such as sustained-no-fan decode decay, are noted as future runtime tracks, and the "Burst" mode is not recommended for long contexts.
4 days ago
Inactive
antirez