EAGLE  by SafeAILab

Speculative decoding research paper for faster LLM inference

created 1 year ago
1,443 stars

Top 28.9% on sourcepulse

GitHubView on GitHub
Project Summary

EAGLE provides a novel approach to accelerate Large Language Model (LLM) inference by extrapolating contextual feature vectors, offering significant speedups with provable performance maintenance. It targets researchers and engineers seeking to optimize LLM deployment, enabling faster generation across various models and frameworks.

How It Works

EAGLE employs speculative decoding by extrapolating feature vectors from the second-to-top layer of LLMs. This method allows for faster generation while maintaining consistency with vanilla decoding. EAGLE-2 enhances this by dynamically adjusting draft tree structures based on confidence scores, and EAGLE-3 further improves speed and performance by fusing low-, mid-, and high-level semantic features, trained via training-time testing.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Python, PyTorch, and Hugging Face Transformers. Specific model weights are available on Hugging Face.
  • Training and inference can be performed on 8x RTX 3090 GPUs.
  • Official documentation and integration details are available via provided paper and blog links.

Highlighted Details

  • Certified as the fastest speculative method by third-party evaluation.
  • Achieves up to 2x speedup on gpt-fast and 3x faster than vanilla decoding on a 13B model.
  • Compatible with vLLM, DeepSpeed, Mamba, FlashAttention, and quantization.
  • EAGLE-3 offers up to 5.6x speedup over vanilla decoding.

Maintenance & Community

The project has seen recent updates with the release of EAGLE-3 and support for models like Qwen-2 and LLaMA-3. It is integrated into several mainstream LLM serving frameworks including NVIDIA TensorRT-LLM and vLLM.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that for Qwen2 models, bf16 precision is recommended over fp16 to avoid numerical overflow. For non-English data with Qwen2, custom training data is suggested. Custom LLM integration requires modifying model code.

Health Check
Last commit

4 days ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
18
Star History
247 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Zhuohan Li Zhuohan Li(Author of vLLM), and
1 more.

Consistency_LLM by hao-ai-lab

0%
397
Parallel decoder for efficient LLM inference
created 1 year ago
updated 8 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Ying Sheng Ying Sheng(Author of SGLang), and
1 more.

LookaheadDecoding by hao-ai-lab

0.1%
1k
Parallel decoding algorithm for faster LLM inference
created 1 year ago
updated 4 months ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
3 more.

Medusa by FasterDecoding

0.2%
3k
Framework for accelerating LLM generation using multiple decoding heads
created 1 year ago
updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
10 more.

qlora by artidoro

0.2%
11k
Finetuning tool for quantized LLMs
created 2 years ago
updated 1 year ago
Feedback? Help us improve.