Parallel decoding algorithm for faster LLM inference
Top 32.0% on sourcepulse
This repository introduces Lookahead Decoding, a novel parallel inference algorithm for Large Language Models (LLMs) that significantly accelerates generation speed without requiring a draft model or data store. It targets researchers and engineers seeking to optimize LLM inference latency, offering speedups of 1.5x to 2.3x.
How It Works
Lookahead Decoding adapts Jacobi Decoding, which treats LLM inference as solving nonlinear systems for simultaneous future token prediction. It enhances Jacobi's feasibility by caching and verifying n-grams generated from Jacobi iteration trajectories. The algorithm employs two parallel branches: a lookahead branch generates n-grams within a defined window (W, N parameters), and a verification branch uses string matching to identify and validate candidate n-grams via LLM forward passes, all optimized within a single attention mask.
Quick Start & Requirements
pip install lade
git clone https://github.com/hao-ai-lab/LookaheadDecoding.git && cd LookaheadDecoding && pip install -r requirements.txt && pip install -e .
python minimal.py
(with USE_LADE=1 LOAD_LADE=1
)Highlighted Details
Maintenance & Community
decoding.py
, with model adaptations in models/
.Licensing & Compatibility
Limitations & Caveats
4 months ago
1 day