Speculative sampling for fast LLM inference
Top 45.3% on sourcepulse
This repository implements speculative sampling for fast Large Language Model (LLM) inference, targeting researchers and engineers seeking to accelerate decoding. It offers a significant speedup by using a smaller "approximation" model to generate draft tokens, which are then verified and corrected by a larger "target" model, reducing the number of forward passes required by the target model.
How It Works
The core approach involves a two-model decoding strategy. A smaller, faster approximation model generates a sequence of candidate tokens. The larger, more accurate target model then processes these candidate tokens in parallel, accepting correct tokens and only re-generating from the point of divergence. This parallel verification significantly reduces the computational load compared to sequential token generation by the target model alone. The implementation includes variations based on Google's and DeepMind's independent proposals for speculative sampling.
Quick Start & Requirements
pip
.bloomz-7b1
(target) with bloom-560m
(approximation), and llama2-7B
(approximation) with llama2-70B
(target).python main.py --input "..." --target_model_name ... --approx_model_name ...
python serving.py
Highlighted Details
llama2-7B
(approx) and llama2-70B
(target).Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Currently supports only batch size 1 and lacks optimizations like batching and parallelism, which are crucial for real-world efficiency. The author notes potential overhead from Softmax and LayerNorm operations.
11 months ago
1 week