Medusa by FasterDecoding

Framework for accelerating LLM generation using multiple decoding heads

Created 2 years ago

2,687 stars

Top 17.4% on SourcePulse

View on GitHub

11 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Chaoyu Yang

Founder of Bento

Luis Capelo

Cofounder of Lightning AI

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

and 7 more!

Project Summary

Medusa is a framework designed to accelerate Large Language Model (LLM) generation by employing multiple decoding heads. It targets researchers and developers seeking to improve inference speed without the complexity of traditional speculative decoding, offering a parameter-efficient way to enhance LLM performance, particularly for single-batch inference scenarios common in local hosting.

How It Works

Medusa augments existing LLMs by adding new, lightweight "heads" that predict multiple future tokens concurrently. Unlike speculative decoding, it doesn't require a separate draft model. Instead, these new heads are fine-tuned on the same model, making the process parameter-efficient. During generation, these heads propose multiple token candidates, which are then processed via a tree-based attention mechanism. An acceptance scheme selects the longest valid prefix from these candidates, enabling faster decoding. The Medusa-2 variant allows for full-model training, and self-distillation enables its application to any fine-tuned LLM without original training data.

Quick Start & Requirements

Install: pip install medusa-llm or from source (git clone https://github.com/FasterDecoding/Medusa.git, cd Medusa, pip install -e .).
Prerequisites: CUDA-enabled GPU is required for inference and training. Training requires accelerate and axolotl.
Resources: Model weights for various sizes (7B, 13B, 33B) are available on Hugging Face. Inference supports 8-bit and 4-bit quantization.
Docs: Blog, Report, Roadmap.

Highlighted Details

Achieves 2x speedup for batch size 1 inference on Vicuna models.
Medusa-2 offers 2.2-3.6x speedup via full-model training.
Supports self-distillation to apply Medusa to any fine-tuned LLM.
Integrated with TensorRT-LLM, TGI, and RTP-LLM.

Maintenance & Community

The project is actively developed, with a roadmap available. Community contributions are welcomed via GitHub issues and pull requests. Support is provided by Together AI, MyShell AI, and Chai AI.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Current inference support is primarily for single-GPU, batch size 1. While efforts are underway to expand framework integration and performance, broader multi-GPU or distributed inference capabilities may be limited. The legacy training instructions are provided but the updated version uses axolotl.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days