LLMSpeculativeSampling  by feifeibear

Speculative sampling for fast LLM inference

Created 2 years ago
821 stars

Top 43.2% on SourcePulse

GitHubView on GitHub
Project Summary

This repository implements speculative sampling for fast Large Language Model (LLM) inference, targeting researchers and engineers seeking to accelerate decoding. It offers a significant speedup by using a smaller "approximation" model to generate draft tokens, which are then verified and corrected by a larger "target" model, reducing the number of forward passes required by the target model.

How It Works

The core approach involves a two-model decoding strategy. A smaller, faster approximation model generates a sequence of candidate tokens. The larger, more accurate target model then processes these candidate tokens in parallel, accepting correct tokens and only re-generating from the point of divergence. This parallel verification significantly reduces the computational load compared to sequential token generation by the target model alone. The implementation includes variations based on Google's and DeepMind's independent proposals for speculative sampling.

Quick Start & Requirements

  • Install via pip.
  • Requires two compatible LLMs (same embedding/vocabulary, approximation model smaller than target). Tested pairs include bloomz-7b1 (target) with bloom-560m (approximation), and llama2-7B (approximation) with llama2-70B (target).
  • Inference example: python main.py --input "..." --target_model_name ... --approx_model_name ...
  • Serving example: python serving.py

Highlighted Details

  • Implements both Google's and DeepMind's speculative sampling algorithms.
  • Includes KV Cache Optimization for the Google version.
  • Supports serving features and models like Llama and Bloom.
  • Demonstrates speedup with llama2-7B (approx) and llama2-70B (target).

Maintenance & Community

  • Initial release in August 2023, with updates in September 2023 adding serving features and model support.
  • Open to contributions for performance improvements.

Licensing & Compatibility

  • No license is explicitly stated in the README.

Limitations & Caveats

Currently supports only batch size 1 and lacks optimizations like batching and parallelism, which are crucial for real-world efficiency. The author notes potential overhead from Softmax and LayerNorm operations.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
4 more.

batch_invariant_ops by thinking-machines-lab

65.1%
636
Enhance LLM inference determinism
Created 1 week ago
Updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.