EAGLE  by SafeAILab

Speculative decoding research paper for faster LLM inference

Created 1 year ago
1,790 stars

Top 24.0% on SourcePulse

GitHubView on GitHub
Project Summary

EAGLE provides a novel approach to accelerate Large Language Model (LLM) inference by extrapolating contextual feature vectors, offering significant speedups with provable performance maintenance. It targets researchers and engineers seeking to optimize LLM deployment, enabling faster generation across various models and frameworks.

How It Works

EAGLE employs speculative decoding by extrapolating feature vectors from the second-to-top layer of LLMs. This method allows for faster generation while maintaining consistency with vanilla decoding. EAGLE-2 enhances this by dynamically adjusting draft tree structures based on confidence scores, and EAGLE-3 further improves speed and performance by fusing low-, mid-, and high-level semantic features, trained via training-time testing.

Quick Start & Requirements

  • Install via pip install -r requirements.txt after cloning the repository.
  • Requires Python, PyTorch, and Hugging Face Transformers. Specific model weights are available on Hugging Face.
  • Training and inference can be performed on 8x RTX 3090 GPUs.
  • Official documentation and integration details are available via provided paper and blog links.

Highlighted Details

  • Certified as the fastest speculative method by third-party evaluation.
  • Achieves up to 2x speedup on gpt-fast and 3x faster than vanilla decoding on a 13B model.
  • Compatible with vLLM, DeepSpeed, Mamba, FlashAttention, and quantization.
  • EAGLE-3 offers up to 5.6x speedup over vanilla decoding.

Maintenance & Community

The project has seen recent updates with the release of EAGLE-3 and support for models like Qwen-2 and LLaMA-3. It is integrated into several mainstream LLM serving frameworks including NVIDIA TensorRT-LLM and vLLM.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that for Qwen2 models, bf16 precision is recommended over fp16 to avoid numerical overflow. For non-English data with Qwen2, custom training data is suggested. Custom LLM integration requires modifying model code.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
4
Issues (30d)
15
Star History
427 stars in the last 30 days

Explore Similar Projects

Starred by Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

Consistency_LLM by hao-ai-lab

0.3%
404
Parallel decoder for efficient LLM inference
Created 1 year ago
Updated 10 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Feedback? Help us improve.