Framework for accelerating LLM generation using multiple decoding heads
Top 18.6% on sourcepulse
Medusa is a framework designed to accelerate Large Language Model (LLM) generation by employing multiple decoding heads. It targets researchers and developers seeking to improve inference speed without the complexity of traditional speculative decoding, offering a parameter-efficient way to enhance LLM performance, particularly for single-batch inference scenarios common in local hosting.
How It Works
Medusa augments existing LLMs by adding new, lightweight "heads" that predict multiple future tokens concurrently. Unlike speculative decoding, it doesn't require a separate draft model. Instead, these new heads are fine-tuned on the same model, making the process parameter-efficient. During generation, these heads propose multiple token candidates, which are then processed via a tree-based attention mechanism. An acceptance scheme selects the longest valid prefix from these candidates, enabling faster decoding. The Medusa-2 variant allows for full-model training, and self-distillation enables its application to any fine-tuned LLM without original training data.
Quick Start & Requirements
pip install medusa-llm
or from source (git clone https://github.com/FasterDecoding/Medusa.git
, cd Medusa
, pip install -e .
).accelerate
and axolotl
.Highlighted Details
Maintenance & Community
The project is actively developed, with a roadmap available. Community contributions are welcomed via GitHub issues and pull requests. Support is provided by Together AI, MyShell AI, and Chai AI.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Current inference support is primarily for single-GPU, batch size 1. While efforts are underway to expand framework integration and performance, broader multi-GPU or distributed inference capabilities may be limited. The legacy training instructions are provided but the updated version uses axolotl
.
1 year ago
1 day