LLMSpeculativeSampling by feifeibear

Speculative sampling for fast LLM inference

Created 2 years ago

879 stars

Top 41.0% on SourcePulse

Project Summary

This repository implements speculative sampling for fast Large Language Model (LLM) inference, targeting researchers and engineers seeking to accelerate decoding. It offers a significant speedup by using a smaller "approximation" model to generate draft tokens, which are then verified and corrected by a larger "target" model, reducing the number of forward passes required by the target model.

How It Works

The core approach involves a two-model decoding strategy. A smaller, faster approximation model generates a sequence of candidate tokens. The larger, more accurate target model then processes these candidate tokens in parallel, accepting correct tokens and only re-generating from the point of divergence. This parallel verification significantly reduces the computational load compared to sequential token generation by the target model alone. The implementation includes variations based on Google's and DeepMind's independent proposals for speculative sampling.

Quick Start & Requirements

Install via pip.
Requires two compatible LLMs (same embedding/vocabulary, approximation model smaller than target). Tested pairs include bloomz-7b1 (target) with bloom-560m (approximation), and llama2-7B (approximation) with llama2-70B (target).
Inference example: python main.py --input "..." --target_model_name ... --approx_model_name ...
Serving example: python serving.py

Highlighted Details

Implements both Google's and DeepMind's speculative sampling algorithms.
Includes KV Cache Optimization for the Google version.
Supports serving features and models like Llama and Bloom.
Demonstrates speedup with llama2-7B (approx) and llama2-70B (target).

Maintenance & Community

Initial release in August 2023, with updates in September 2023 adding serving features and model support.
Open to contributions for performance improvements.

Licensing & Compatibility

No license is explicitly stated in the README.

Limitations & Caveats

Currently supports only batch size 1 and lacks optimizations like batching and parallelism, which are crucial for real-world efficiency. The author notes potential overhead from Softmax and LayerNorm operations.

LLMSpeculativeSampling by feifeibear

Explore Similar Projects

Awesome_LLM_System-PaperList by galeselee

flex-nano-vllm by changjonathanc

dots.llm1 by rednote-hilab

Quest by mit-han-lab

omniserve by mit-han-lab

batch_invariant_ops by thinking-machines-lab

LookaheadDecoding by hao-ai-lab

MobileLLM by facebookresearch

Awesome-Efficient-LLM by horseee

R-KV by Zefan-Cai

EAGLE by SafeAILab

entropix by xjdr-alt