LookaheadDecoding  by hao-ai-lab

Parallel decoding algorithm for faster LLM inference

Created 1 year ago
1,279 stars

Top 31.1% on SourcePulse

GitHubView on GitHub
Project Summary

This repository introduces Lookahead Decoding, a novel parallel inference algorithm for Large Language Models (LLMs) that significantly accelerates generation speed without requiring a draft model or data store. It targets researchers and engineers seeking to optimize LLM inference latency, offering speedups of 1.5x to 2.3x.

How It Works

Lookahead Decoding adapts Jacobi Decoding, which treats LLM inference as solving nonlinear systems for simultaneous future token prediction. It enhances Jacobi's feasibility by caching and verifying n-grams generated from Jacobi iteration trajectories. The algorithm employs two parallel branches: a lookahead branch generates n-grams within a defined window (W, N parameters), and a verification branch uses string matching to identify and validate candidate n-grams via LLM forward passes, all optimized within a single attention mask.

Quick Start & Requirements

  • Install via pip: pip install lade
  • Install from source: git clone https://github.com/hao-ai-lab/LookaheadDecoding.git && cd LookaheadDecoding && pip install -r requirements.txt && pip install -e .
  • Dependencies: Python, PyTorch. FlashAttention v2.3.3 is recommended for optimal performance.
  • Demo: python minimal.py (with USE_LADE=1 LOAD_LADE=1)
  • Docs: Paper, Blog

Highlighted Details

  • Achieves 1.5x-2.3x latency reduction on various LLMs and datasets.
  • Eliminates sequential dependency without draft models or data stores.
  • Supports FlashAttention for further performance gains.
  • Integrates into existing code with minimal changes (3 LoCs).

Maintenance & Community

  • The project is associated with ICML 2024.
  • Core implementation is in decoding.py, with model adaptations in models/.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Currently supports LLaMA models only.
  • FlashAttention installation may require specific CUDA/PyTorch versions.
Health Check
Last Commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0.3%
404
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 10 months ago
Starred by Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
4 more.

batch_invariant_ops by thinking-machines-lab

65.1%
636
Enhance LLM inference determinism
Created 1 week ago
Updated 1 week ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

10.6%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.