LookaheadDecoding  by hao-ai-lab

Parallel decoding algorithm for faster LLM inference

created 1 year ago
1,263 stars

Top 32.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository introduces Lookahead Decoding, a novel parallel inference algorithm for Large Language Models (LLMs) that significantly accelerates generation speed without requiring a draft model or data store. It targets researchers and engineers seeking to optimize LLM inference latency, offering speedups of 1.5x to 2.3x.

How It Works

Lookahead Decoding adapts Jacobi Decoding, which treats LLM inference as solving nonlinear systems for simultaneous future token prediction. It enhances Jacobi's feasibility by caching and verifying n-grams generated from Jacobi iteration trajectories. The algorithm employs two parallel branches: a lookahead branch generates n-grams within a defined window (W, N parameters), and a verification branch uses string matching to identify and validate candidate n-grams via LLM forward passes, all optimized within a single attention mask.

Quick Start & Requirements

  • Install via pip: pip install lade
  • Install from source: git clone https://github.com/hao-ai-lab/LookaheadDecoding.git && cd LookaheadDecoding && pip install -r requirements.txt && pip install -e .
  • Dependencies: Python, PyTorch. FlashAttention v2.3.3 is recommended for optimal performance.
  • Demo: python minimal.py (with USE_LADE=1 LOAD_LADE=1)
  • Docs: Paper, Blog

Highlighted Details

  • Achieves 1.5x-2.3x latency reduction on various LLMs and datasets.
  • Eliminates sequential dependency without draft models or data stores.
  • Supports FlashAttention for further performance gains.
  • Integrates into existing code with minimal changes (3 LoCs).

Maintenance & Community

  • The project is associated with ICML 2024.
  • Core implementation is in decoding.py, with model adaptations in models/.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • Currently supports LLaMA models only.
  • FlashAttention installation may require specific CUDA/PyTorch versions.
Health Check
Last commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
22 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

HALOs by ContextualAI

0.2%
873
Library for aligning LLMs using human-aware loss functions
created 1 year ago
updated 2 weeks ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Feedback? Help us improve.