MTPLX by youssofal

Native MTP speculative decoding for accelerated LLM inference

Created 2 months ago

1,004 stars

Top 36.4% on SourcePulse

Project Summary

MTPLX addresses the challenge of high-latency LLM inference, particularly with speculative decoding methods that fail at non-zero temperatures. It offers a native MTP speculative decoding solution specifically for Apple Silicon, enabling significant speedups without sacrificing output quality. This project benefits engineers and researchers seeking to maximize LLM performance on macOS hardware.

How It Works

The core innovation is native MTP speculative decoding, leveraging the target model's own MTP heads as a speculative drafter, thus avoiding the RAM overhead of a second model. MTPLX employs mathematically precise probability-ratio acceptance with residual correction, ensuring accurate sampling even at temperatures above zero, a key differentiator from greedy-argmax approaches. This MLX-native runtime is built for Apple Silicon and integrates a full OpenAI/Anthropic-compatible serving stack.

Quick Start & Requirements

Installation is straightforward via Homebrew (brew install youssofal/mtplx/mtplx) or pip (python3 -m pip install -U mtplx). The project requires macOS with Apple Silicon and Python 3.11+. An interactive wizard (mtplx start) guides users through model selection, runtime configuration, and launching the serving surface, simplifying setup.

Highlighted Details

Achieves up to ~2.24x decode throughput increase on models like Qwen 3.6 27B at temp=0.6.
Provides a native MTP speculative decoding engine with no external drafter, minimizing RAM usage.
Offers a full OpenAI/Anthropic-compatible serving API for seamless integration with existing tools and UIs.
Supports agent tool calls and features an in-browser chat UI with live performance metrics and MTP toggling.
Includes advanced features like local model discovery, crash-safe fan control (with optional ThermalForge), and detailed compatibility checks.

Maintenance & Community

The project is primarily developed by Youssof Altoukhi. Contributions, bug reports, and benchmark replications are welcomed via GitHub Issues. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

MTPLX is released under the permissive Apache License 2.0, allowing commercial use, modification, and distribution with attribution. It is strictly limited to Apple Silicon (macOS) and does not support Linux/CUDA, with no roadmap for such expansion.

Limitations & Caveats

The project is exclusively for Apple Silicon hardware and requires macOS 14.0+. Linux/CUDA users should consider alternatives like vLLM. Certain advanced features, such as sustained-no-fan decode decay, are noted as future runtime tracks, and the "Burst" mode is not recommended for long contexts.

MTPLX by youssofal

Explore Similar Projects

vllm-swift by TheTom

ntransformer by xaskasdf

orthrus by chiennv2000

MoE-Infinity by EfficientMoE

dflash-mlx by Aryagm

SwiftLM by SharpAI

dflash-mlx by bstnxbt

ssd by tanishqkumar

colibri by JustVugg

vllm-metal by vllm-project

lucebox by Luce-Org

ds4 by antirez