dynamo  by ai-dynamo

Inference framework for distributed generative AI model serving

created 5 months ago
4,593 stars

Top 10.9% on sourcepulse

GitHubView on GitHub
Project Summary

NVIDIA Dynamo is a distributed inference serving framework designed for high-throughput, low-latency serving of generative AI and reasoning models across multiple nodes. It targets users needing to deploy LLMs at scale, offering features like disaggregated prefill/decode, dynamic GPU scheduling, and KV cache offloading to optimize performance and resource utilization.

How It Works

Dynamo employs a disaggregated architecture, separating prefill and decode stages to maximize GPU utilization and allow flexible throughput/latency trade-offs. It features LLM-aware request routing for efficient KV cache management and dynamic GPU scheduling to adapt to fluctuating workloads. Built with Rust for performance and Python for extensibility, it supports multiple inference backends (TRT-LLM, vLLM, SGLang) and utilizes NIXL for accelerated data transfer.

Quick Start & Requirements

  • Install: pip install ai-dynamo[all]
  • Prerequisites: Ubuntu 24.04 (recommended), python3-dev, python3-pip, python3-venv, libucx0. CUDA and specific inference backends may require additional setup.
  • Resources: Building the base Docker image is required for Kubernetes deployment. Local testing involves docker compose.
  • Docs: Roadmap, Support Matrix, Guides

Highlighted Details

  • Supports multiple inference backends (TRT-LLM, vLLM, SGLang).
  • Disaggregated prefill and decode for throughput/latency optimization.
  • Dynamic GPU scheduling and LLM-aware request routing.
  • KV cache offloading for enhanced system throughput.
  • OpenAI-compatible frontend.

Maintenance & Community

  • Open-source first development approach.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state the license. It mentions "fully open-source" and "OSS (Open Source Software) first development approach."

Limitations & Caveats

The README recommends Ubuntu 24.04, suggesting potential compatibility issues on other operating systems. Building custom Docker images is necessary for Kubernetes deployments. Specific backend compatibility details are linked but require further investigation.

Health Check
Last commit

19 hours ago

Responsiveness

1 day

Pull Requests (30d)
481
Issues (30d)
86
Star History
722 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
2 more.

lorax by predibase

0.4%
3k
Multi-LoRA inference server for serving 1000s of fine-tuned LLMs
created 1 year ago
updated 2 months ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
2 more.

gpustack by gpustack

1.6%
3k
GPU cluster manager for AI model deployment
created 1 year ago
updated 2 days ago
Feedback? Help us improve.