MTPLX  by youssofal

Native MTP speculative decoding for accelerated LLM inference

Created 3 weeks ago

New!

617 stars

Top 53.0% on SourcePulse

GitHubView on GitHub
Project Summary

MTPLX addresses the challenge of high-latency LLM inference, particularly with speculative decoding methods that fail at non-zero temperatures. It offers a native MTP speculative decoding solution specifically for Apple Silicon, enabling significant speedups without sacrificing output quality. This project benefits engineers and researchers seeking to maximize LLM performance on macOS hardware.

How It Works

The core innovation is native MTP speculative decoding, leveraging the target model's own MTP heads as a speculative drafter, thus avoiding the RAM overhead of a second model. MTPLX employs mathematically precise probability-ratio acceptance with residual correction, ensuring accurate sampling even at temperatures above zero, a key differentiator from greedy-argmax approaches. This MLX-native runtime is built for Apple Silicon and integrates a full OpenAI/Anthropic-compatible serving stack.

Quick Start & Requirements

Installation is straightforward via Homebrew (brew install youssofal/mtplx/mtplx) or pip (python3 -m pip install -U mtplx). The project requires macOS with Apple Silicon and Python 3.11+. An interactive wizard (mtplx start) guides users through model selection, runtime configuration, and launching the serving surface, simplifying setup.

Highlighted Details

  • Achieves up to ~2.24x decode throughput increase on models like Qwen 3.6 27B at temp=0.6.
  • Provides a native MTP speculative decoding engine with no external drafter, minimizing RAM usage.
  • Offers a full OpenAI/Anthropic-compatible serving API for seamless integration with existing tools and UIs.
  • Supports agent tool calls and features an in-browser chat UI with live performance metrics and MTP toggling.
  • Includes advanced features like local model discovery, crash-safe fan control (with optional ThermalForge), and detailed compatibility checks.

Maintenance & Community

The project is primarily developed by Youssof Altoukhi. Contributions, bug reports, and benchmark replications are welcomed via GitHub Issues. Specific community channels like Discord or Slack are not detailed in the README.

Licensing & Compatibility

MTPLX is released under the permissive Apache License 2.0, allowing commercial use, modification, and distribution with attribution. It is strictly limited to Apple Silicon (macOS) and does not support Linux/CUDA, with no roadmap for such expansion.

Limitations & Caveats

The project is exclusively for Apple Silicon hardware and requires macOS 14.0+. Linux/CUDA users should consider alternatives like vLLM. Certain advanced features, such as sustained-no-fan decode decay, are noted as future runtime tracks, and the "Burst" mode is not recommended for long contexts.

Health Check
Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)
42
Issues (30d)
43
Star History
621 stars in the last 25 days

Explore Similar Projects

Starred by Balaji Srinivasan Balaji Srinivasan(Founder of The Network School; Author of "The Network State"; Former CTO of Coinbase; Cofounder of Counsyl), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
13 more.

ds4 by antirez

8.4%
12k
Fast local inference for DeepSeek V4 Flash models
Created 3 weeks ago
Updated 1 day ago
Feedback? Help us improve.