speculators  by vllm-project

Accelerating LLM inference with speculative decoding

Created 11 months ago
272 stars

Top 94.8% on SourcePulse

GitHubView on GitHub
Project Summary

Speculators offers a unified library for building, training, and deploying speculative decoding algorithms within LLM inference frameworks like vLLM. It addresses the challenge of high inference latency by enabling significant speedups without sacrificing output quality. The library targets engineers and researchers seeking to optimize LLM serving, providing a standardized, end-to-end solution for creating and integrating speculative decoding models into production environments.

How It Works

Speculative decoding utilizes a smaller, faster "draft" model to propose multiple tokens ahead of the main sequence. A larger, more capable "base" model then verifies these proposed tokens in a single forward pass. This process allows for faster generation because the expensive base model is invoked less frequently per generated token. Speculators standardizes this technique, offering tools for offline data generation, end-to-end training of draft models (supporting MoE, non-MoE, and Vision Language models), and a Hugging Face-compatible format for model definition, ensuring easy adoption and seamless integration with vLLM for production deployment.

Quick Start & Requirements

  • Installation:
    • PyPI (Recommended): pip install speculators
    • Source: git clone https://github.com/vllm-project/speculators.git && cd speculators && pip install -e .
    • Development: pip install -e ".[dev]"
    • Data Generation: pip install -e ".[datagen]"
  • Prerequisites:
    • Operating System: Linux or macOS
    • Python: 3.10 or higher
  • Documentation: https://docs.vllm.ai/projects/speculators/en/latest/

Highlighted Details

  • Offline Data Generation: Utilizes vLLM to generate hidden states for draft model training.
  • End-to-End Training: Supports training of single and multi-layer draft models across various architectures (MoE, non-MoE, Vision Language).
  • Standardized Format: Provides a Hugging Face-compatible format for defining speculative models, facilitating conversion from external research.
  • vLLM Integration: Designed for direct deployment into vLLM, enabling low-latency, production-grade inference.
  • Model Support: Includes training and deployment support for models like Llama, Qwen, GPT-OSS, and others, with ongoing work for Mistral models.

Maintenance & Community

The project is associated with the vLLM ecosystem, with contributions indicated from Red Hat (e.g., RedHatAI model names). Community discussions and support are available via the vLLM Community Slack channels: #speculators and #feat-spec-decode.

Licensing & Compatibility

The library is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant copyleft restrictions.

Limitations & Caveats

Some advanced model support, such as for Mistral 3 Large, is marked as "In Progress." The library requires specific operating systems (Linux/macOS) and Python versions (3.10+). Performance gains are dependent on the effectiveness of the trained draft model relative to the base model.

Health Check
Last Commit

23 hours ago

Responsiveness

Inactive

Pull Requests (30d)
43
Issues (30d)
9
Star History
42 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
1 more.

ArcticInference by snowflakedb

0.5%
410
vLLM plugin for high-throughput, low-latency LLM and embedding inference
Created 11 months ago
Updated 1 week ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

recurrent-pretraining by seal-rg

0.1%
866
Pretraining code for depth-recurrent language model research
Created 1 year ago
Updated 2 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.5%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 3 weeks ago
Feedback? Help us improve.