speculators by vllm-project

Accelerating LLM inference with speculative decoding

Created 1 year ago

508 stars

Top 60.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Luis Capelo

Cofounder of Lightning AI

Project Summary

Speculators offers a unified library for building, training, and deploying speculative decoding algorithms within LLM inference frameworks like vLLM. It addresses the challenge of high inference latency by enabling significant speedups without sacrificing output quality. The library targets engineers and researchers seeking to optimize LLM serving, providing a standardized, end-to-end solution for creating and integrating speculative decoding models into production environments.

How It Works

Speculative decoding utilizes a smaller, faster "draft" model to propose multiple tokens ahead of the main sequence. A larger, more capable "base" model then verifies these proposed tokens in a single forward pass. This process allows for faster generation because the expensive base model is invoked less frequently per generated token. Speculators standardizes this technique, offering tools for offline data generation, end-to-end training of draft models (supporting MoE, non-MoE, and Vision Language models), and a Hugging Face-compatible format for model definition, ensuring easy adoption and seamless integration with vLLM for production deployment.

Quick Start & Requirements

Installation:
- PyPI (Recommended): pip install speculators
- Source: git clone https://github.com/vllm-project/speculators.git && cd speculators && pip install -e .
- Development: pip install -e ".[dev]"
- Data Generation: pip install -e ".[datagen]"
Prerequisites:
- Operating System: Linux or macOS
- Python: 3.10 or higher
Documentation: https://docs.vllm.ai/projects/speculators/en/latest/

Highlighted Details

Offline Data Generation: Utilizes vLLM to generate hidden states for draft model training.
End-to-End Training: Supports training of single and multi-layer draft models across various architectures (MoE, non-MoE, Vision Language).
Standardized Format: Provides a Hugging Face-compatible format for defining speculative models, facilitating conversion from external research.
vLLM Integration: Designed for direct deployment into vLLM, enabling low-latency, production-grade inference.
Model Support: Includes training and deployment support for models like Llama, Qwen, GPT-OSS, and others, with ongoing work for Mistral models.

Maintenance & Community

The project is associated with the vLLM ecosystem, with contributions indicated from Red Hat (e.g., RedHatAI model names). Community discussions and support are available via the vLLM Community Slack channels: #speculators and #feat-spec-decode.

Licensing & Compatibility

The library is licensed under the Apache License 2.0. This permissive license generally allows for commercial use and integration into closed-source projects without significant copyleft restrictions.

Limitations & Caveats

Some advanced model support, such as for Mistral 3 Large, is marked as "In Progress." The library requires specific operating systems (Linux/macOS) and Python versions (3.10+). Performance gains are dependent on the effectiveness of the trained draft model relative to the base model.

speculators by vllm-project

Explore Similar Projects

Ling-V2 by inclusionAI

LLaDA2.X by inclusionAI

timber by kossisoroyce

dots.llm1 by rednote-hilab

xinfer by guoqingbao

ScaleLLM by vectorch-ai

Seed-Coder by ByteDance-Seed

ArcticInference by snowflakedb

recurrent-pretraining by seal-rg

tract by sonos

EAGLE by SafeAILab

lm-evaluation-harness by EleutherAI