LLMSpeculativeSampling  by feifeibear

Speculative sampling for fast LLM inference

created 1 year ago
791 stars

Top 45.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository implements speculative sampling for fast Large Language Model (LLM) inference, targeting researchers and engineers seeking to accelerate decoding. It offers a significant speedup by using a smaller "approximation" model to generate draft tokens, which are then verified and corrected by a larger "target" model, reducing the number of forward passes required by the target model.

How It Works

The core approach involves a two-model decoding strategy. A smaller, faster approximation model generates a sequence of candidate tokens. The larger, more accurate target model then processes these candidate tokens in parallel, accepting correct tokens and only re-generating from the point of divergence. This parallel verification significantly reduces the computational load compared to sequential token generation by the target model alone. The implementation includes variations based on Google's and DeepMind's independent proposals for speculative sampling.

Quick Start & Requirements

  • Install via pip.
  • Requires two compatible LLMs (same embedding/vocabulary, approximation model smaller than target). Tested pairs include bloomz-7b1 (target) with bloom-560m (approximation), and llama2-7B (approximation) with llama2-70B (target).
  • Inference example: python main.py --input "..." --target_model_name ... --approx_model_name ...
  • Serving example: python serving.py

Highlighted Details

  • Implements both Google's and DeepMind's speculative sampling algorithms.
  • Includes KV Cache Optimization for the Google version.
  • Supports serving features and models like Llama and Bloom.
  • Demonstrates speedup with llama2-7B (approx) and llama2-70B (target).

Maintenance & Community

  • Initial release in August 2023, with updates in September 2023 adding serving features and model support.
  • Open to contributions for performance improvements.

Licensing & Compatibility

  • No license is explicitly stated in the README.

Limitations & Caveats

Currently supports only batch size 1 and lacks optimizations like batching and parallelism, which are crucial for real-world efficiency. The author notes potential overhead from Softmax and LayerNorm operations.

Health Check
Last commit

11 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
75 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 5 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.