Speculative decoding research paper for faster LLM inference
Top 28.9% on sourcepulse
EAGLE provides a novel approach to accelerate Large Language Model (LLM) inference by extrapolating contextual feature vectors, offering significant speedups with provable performance maintenance. It targets researchers and engineers seeking to optimize LLM deployment, enabling faster generation across various models and frameworks.
How It Works
EAGLE employs speculative decoding by extrapolating feature vectors from the second-to-top layer of LLMs. This method allows for faster generation while maintaining consistency with vanilla decoding. EAGLE-2 enhances this by dynamically adjusting draft tree structures based on confidence scores, and EAGLE-3 further improves speed and performance by fusing low-, mid-, and high-level semantic features, trained via training-time testing.
Quick Start & Requirements
pip install -r requirements.txt
after cloning the repository.Highlighted Details
gpt-fast
and 3x faster than vanilla decoding on a 13B model.Maintenance & Community
The project has seen recent updates with the release of EAGLE-3 and support for models like Qwen-2 and LLaMA-3. It is integrated into several mainstream LLM serving frameworks including NVIDIA TensorRT-LLM and vLLM.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions that for Qwen2 models, bf16
precision is recommended over fp16
to avoid numerical overflow. For non-English data with Qwen2, custom training data is suggested. Custom LLM integration requires modifying model code.
4 days ago
1 week