Medusa  by FasterDecoding

Framework for accelerating LLM generation using multiple decoding heads

created 1 year ago
2,583 stars

Top 18.6% on sourcepulse

GitHubView on GitHub
Project Summary

Medusa is a framework designed to accelerate Large Language Model (LLM) generation by employing multiple decoding heads. It targets researchers and developers seeking to improve inference speed without the complexity of traditional speculative decoding, offering a parameter-efficient way to enhance LLM performance, particularly for single-batch inference scenarios common in local hosting.

How It Works

Medusa augments existing LLMs by adding new, lightweight "heads" that predict multiple future tokens concurrently. Unlike speculative decoding, it doesn't require a separate draft model. Instead, these new heads are fine-tuned on the same model, making the process parameter-efficient. During generation, these heads propose multiple token candidates, which are then processed via a tree-based attention mechanism. An acceptance scheme selects the longest valid prefix from these candidates, enabling faster decoding. The Medusa-2 variant allows for full-model training, and self-distillation enables its application to any fine-tuned LLM without original training data.

Quick Start & Requirements

  • Install: pip install medusa-llm or from source (git clone https://github.com/FasterDecoding/Medusa.git, cd Medusa, pip install -e .).
  • Prerequisites: CUDA-enabled GPU is required for inference and training. Training requires accelerate and axolotl.
  • Resources: Model weights for various sizes (7B, 13B, 33B) are available on Hugging Face. Inference supports 8-bit and 4-bit quantization.
  • Docs: Blog, Report, Roadmap.

Highlighted Details

  • Achieves 2x speedup for batch size 1 inference on Vicuna models.
  • Medusa-2 offers 2.2-3.6x speedup via full-model training.
  • Supports self-distillation to apply Medusa to any fine-tuned LLM.
  • Integrated with TensorRT-LLM, TGI, and RTP-LLM.

Maintenance & Community

The project is actively developed, with a roadmap available. Community contributions are welcomed via GitHub issues and pull requests. Support is provided by Together AI, MyShell AI, and Chai AI.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Current inference support is primarily for single-GPU, batch size 1. While efforts are underway to expand framework integration and performance, broader multi-GPU or distributed inference capabilities may be limited. The legacy training instructions are provided but the updated version uses axolotl.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
79 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.