Medusa  by FasterDecoding

Framework for accelerating LLM generation using multiple decoding heads

Created 2 years ago
2,655 stars

Top 17.8% on SourcePulse

GitHubView on GitHub
Project Summary

Medusa is a framework designed to accelerate Large Language Model (LLM) generation by employing multiple decoding heads. It targets researchers and developers seeking to improve inference speed without the complexity of traditional speculative decoding, offering a parameter-efficient way to enhance LLM performance, particularly for single-batch inference scenarios common in local hosting.

How It Works

Medusa augments existing LLMs by adding new, lightweight "heads" that predict multiple future tokens concurrently. Unlike speculative decoding, it doesn't require a separate draft model. Instead, these new heads are fine-tuned on the same model, making the process parameter-efficient. During generation, these heads propose multiple token candidates, which are then processed via a tree-based attention mechanism. An acceptance scheme selects the longest valid prefix from these candidates, enabling faster decoding. The Medusa-2 variant allows for full-model training, and self-distillation enables its application to any fine-tuned LLM without original training data.

Quick Start & Requirements

  • Install: pip install medusa-llm or from source (git clone https://github.com/FasterDecoding/Medusa.git, cd Medusa, pip install -e .).
  • Prerequisites: CUDA-enabled GPU is required for inference and training. Training requires accelerate and axolotl.
  • Resources: Model weights for various sizes (7B, 13B, 33B) are available on Hugging Face. Inference supports 8-bit and 4-bit quantization.
  • Docs: Blog, Report, Roadmap.

Highlighted Details

  • Achieves 2x speedup for batch size 1 inference on Vicuna models.
  • Medusa-2 offers 2.2-3.6x speedup via full-model training.
  • Supports self-distillation to apply Medusa to any fine-tuned LLM.
  • Integrated with TensorRT-LLM, TGI, and RTP-LLM.

Maintenance & Community

The project is actively developed, with a roadmap available. Community contributions are welcomed via GitHub issues and pull requests. Support is provided by Together AI, MyShell AI, and Chai AI.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Current inference support is primarily for single-GPU, batch size 1. While efforts are underway to expand framework integration and performance, broader multi-GPU or distributed inference capabilities may be limited. The legacy training instructions are provided but the updated version uses axolotl.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
23 stars in the last 30 days

Explore Similar Projects

Starred by Casper Hansen Casper Hansen(Author of AutoAWQ), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
5 more.

xtuner by InternLM

0.2%
5k
LLM fine-tuning toolkit for research
Created 2 years ago
Updated 11 hours ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
5 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
Created 2 years ago
Updated 1 year ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
41 more.

guidance by guidance-ai

0.1%
21k
Guidance is a programming paradigm for steering LLMs
Created 3 years ago
Updated 2 weeks ago
Feedback? Help us improve.