DeepSpeed-MII  by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

Created 3 years ago
2,055 stars

Top 21.6% on SourcePulse

GitHubView on GitHub
Project Summary

DeepSpeed-MII is a Python library designed to enable high-throughput, low-latency, and cost-effective inference for large language models and text-to-image models. It targets researchers and developers needing to deploy models efficiently, offering significant performance gains over existing solutions.

How It Works

MII leverages DeepSpeed-Inference and incorporates key technologies like blocked KV-caching, continuous batching, Dynamic SplitFuse, and tensor parallelism. This combination automatically optimizes models based on architecture, size, batch size, and hardware, minimizing latency and maximizing throughput.

Quick Start & Requirements

  • Install via pip: pip install deepspeed-mii
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+.
  • Pre-compiled wheels are provided via the deepspeed-kernels library.
  • See Getting Started with MII for examples.

Highlighted Details

  • Supports over 37,000 models across 11 popular architectures (e.g., Llama, Mistral, Mixtral, Falcon, Qwen).
  • Achieves up to 2.5x higher effective throughput compared to vLLM.
  • Offers both non-persistent (script-based) and persistent (gRPC server) deployment options.
  • Includes support for RESTful API endpoints for inference.

Maintenance & Community

  • Developed by DeepSpeed (Microsoft).
  • Contributions welcome under the Developer Certificate of Origin (DCO).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • While supporting a vast number of models, specific architecture support should be verified.
  • Performance claims are benchmark-dependent and may vary with specific hardware and model configurations.
Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0.4%
455
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 5 months ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
Created 2 years ago
Updated 1 year ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
20 more.

TensorRT-LLM by NVIDIA

0.5%
12k
LLM inference optimization SDK for NVIDIA GPUs
Created 2 years ago
Updated 12 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 12 hours ago
Feedback? Help us improve.