DeepSpeed-MII  by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago
2,073 stars

Top 21.4% on SourcePulse

GitHubView on GitHub
Project Summary

DeepSpeed-MII is a Python library designed to enable high-throughput, low-latency, and cost-effective inference for large language models and text-to-image models. It targets researchers and developers needing to deploy models efficiently, offering significant performance gains over existing solutions.

How It Works

MII leverages DeepSpeed-Inference and incorporates key technologies like blocked KV-caching, continuous batching, Dynamic SplitFuse, and tensor parallelism. This combination automatically optimizes models based on architecture, size, batch size, and hardware, minimizing latency and maximizing throughput.

Quick Start & Requirements

  • Install via pip: pip install deepspeed-mii
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+.
  • Pre-compiled wheels are provided via the deepspeed-kernels library.
  • See Getting Started with MII for examples.

Highlighted Details

  • Supports over 37,000 models across 11 popular architectures (e.g., Llama, Mistral, Mixtral, Falcon, Qwen).
  • Achieves up to 2.5x higher effective throughput compared to vLLM.
  • Offers both non-persistent (script-based) and persistent (gRPC server) deployment options.
  • Includes support for RESTful API endpoints for inference.

Maintenance & Community

  • Developed by DeepSpeed (Microsoft).
  • Contributions welcome under the Developer Certificate of Origin (DCO).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • While supporting a vast number of models, specific architecture support should be verified.
  • Performance claims are benchmark-dependent and may vary with specific hardware and model configurations.
Health Check
Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0%
461
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 6 months ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

ctransformers by marella

0%
2k
Python bindings for fast Transformer model inference
Created 2 years ago
Updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.4%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 9 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

1.1%
62k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 9 hours ago
Feedback? Help us improve.