DeepSpeed-MII by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago

2,085 stars

Top 21.2% on SourcePulse

View on GitHub

10 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Ying Sheng

Coauthor of SGLang

Lianmin Zheng

Coauthor of SGLang, vLLM

Aman Sanger

Cofounder of Cursor

and 6 more!

Project Summary

DeepSpeed-MII is a Python library designed to enable high-throughput, low-latency, and cost-effective inference for large language models and text-to-image models. It targets researchers and developers needing to deploy models efficiently, offering significant performance gains over existing solutions.

How It Works

MII leverages DeepSpeed-Inference and incorporates key technologies like blocked KV-caching, continuous batching, Dynamic SplitFuse, and tensor parallelism. This combination automatically optimizes models based on architecture, size, batch size, and hardware, minimizing latency and maximizing throughput.

Quick Start & Requirements

Install via pip: pip install deepspeed-mii
Requires NVIDIA GPUs with compute capability 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+.
Pre-compiled wheels are provided via the deepspeed-kernels library.
See Getting Started with MII for examples.

Highlighted Details

Supports over 37,000 models across 11 popular architectures (e.g., Llama, Mistral, Mixtral, Falcon, Qwen).
Achieves up to 2.5x higher effective throughput compared to vLLM.
Offers both non-persistent (script-based) and persistent (gRPC server) deployment options.
Includes support for RESTful API endpoints for inference.

Maintenance & Community

Developed by DeepSpeed (Microsoft).
Contributions welcome under the Developer Certificate of Origin (DCO).
Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

While supporting a vast number of models, specific architecture support should be verified.
Performance claims are benchmark-dependent and may vary with specific hardware and model configurations.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days