EAGLE by SafeAILab

Speculative decoding research paper for faster LLM inference

Created 2 years ago

2,114 stars

Top 20.9% on SourcePulse

View on GitHub

10 Experts Love This Project

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Philipp Schmid

DevRel at Google DeepMind

Pawel Garbacki

Cofounder of Fireworks AI

and 6 more!

Project Summary

EAGLE provides a novel approach to accelerate Large Language Model (LLM) inference by extrapolating contextual feature vectors, offering significant speedups with provable performance maintenance. It targets researchers and engineers seeking to optimize LLM deployment, enabling faster generation across various models and frameworks.

How It Works

EAGLE employs speculative decoding by extrapolating feature vectors from the second-to-top layer of LLMs. This method allows for faster generation while maintaining consistency with vanilla decoding. EAGLE-2 enhances this by dynamically adjusting draft tree structures based on confidence scores, and EAGLE-3 further improves speed and performance by fusing low-, mid-, and high-level semantic features, trained via training-time testing.

Quick Start & Requirements

Install via pip install -r requirements.txt after cloning the repository.
Requires Python, PyTorch, and Hugging Face Transformers. Specific model weights are available on Hugging Face.
Training and inference can be performed on 8x RTX 3090 GPUs.
Official documentation and integration details are available via provided paper and blog links.

Highlighted Details

Certified as the fastest speculative method by third-party evaluation.
Achieves up to 2x speedup on gpt-fast and 3x faster than vanilla decoding on a 13B model.
Compatible with vLLM, DeepSpeed, Mamba, FlashAttention, and quantization.
EAGLE-3 offers up to 5.6x speedup over vanilla decoding.

Maintenance & Community

The project has seen recent updates with the release of EAGLE-3 and support for models like Qwen-2 and LLaMA-3. It is integrated into several mainstream LLM serving frameworks including NVIDIA TensorRT-LLM and vLLM.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions that for Qwen2 models, bf16 precision is recommended over fp16 to avoid numerical overflow. For non-English data with Qwen2, custom training data is suggested. Custom LLM integration requires modifying model code.

Health Check

Last Commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

74 stars in the last 30 days