DeepSpeed-MII  by deepspeedai

Python library for high-throughput, low-latency, and cost-effective model inference

created 3 years ago
2,042 stars

Top 22.2% on sourcepulse

GitHubView on GitHub
Project Summary

DeepSpeed-MII is a Python library designed to enable high-throughput, low-latency, and cost-effective inference for large language models and text-to-image models. It targets researchers and developers needing to deploy models efficiently, offering significant performance gains over existing solutions.

How It Works

MII leverages DeepSpeed-Inference and incorporates key technologies like blocked KV-caching, continuous batching, Dynamic SplitFuse, and tensor parallelism. This combination automatically optimizes models based on architecture, size, batch size, and hardware, minimizing latency and maximizing throughput.

Quick Start & Requirements

  • Install via pip: pip install deepspeed-mii
  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+.
  • Pre-compiled wheels are provided via the deepspeed-kernels library.
  • See Getting Started with MII for examples.

Highlighted Details

  • Supports over 37,000 models across 11 popular architectures (e.g., Llama, Mistral, Mixtral, Falcon, Qwen).
  • Achieves up to 2.5x higher effective throughput compared to vLLM.
  • Offers both non-persistent (script-based) and persistent (gRPC server) deployment options.
  • Includes support for RESTful API endpoints for inference.

Maintenance & Community

  • Developed by DeepSpeed (Microsoft).
  • Contributions welcome under the Developer Certificate of Origin (DCO).
  • Follows the Microsoft Open Source Code of Conduct.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • While supporting a vast number of models, specific architecture support should be verified.
  • Performance claims are benchmark-dependent and may vary with specific hardware and model configurations.
Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
44 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), Anton Bukov Anton Bukov(Cofounder of 1inch Network), and
16 more.

tinygrad by tinygrad

0.1%
30k
Minimalist deep learning framework for education and exploration
created 4 years ago
updated 14 hours ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 20 hours ago
Feedback? Help us improve.