Inference framework for distributed generative AI model serving
Top 10.9% on sourcepulse
NVIDIA Dynamo is a distributed inference serving framework designed for high-throughput, low-latency serving of generative AI and reasoning models across multiple nodes. It targets users needing to deploy LLMs at scale, offering features like disaggregated prefill/decode, dynamic GPU scheduling, and KV cache offloading to optimize performance and resource utilization.
How It Works
Dynamo employs a disaggregated architecture, separating prefill and decode stages to maximize GPU utilization and allow flexible throughput/latency trade-offs. It features LLM-aware request routing for efficient KV cache management and dynamic GPU scheduling to adapt to fluctuating workloads. Built with Rust for performance and Python for extensibility, it supports multiple inference backends (TRT-LLM, vLLM, SGLang) and utilizes NIXL for accelerated data transfer.
Quick Start & Requirements
pip install ai-dynamo[all]
python3-dev
, python3-pip
, python3-venv
, libucx0
. CUDA and specific inference backends may require additional setup.docker compose
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README recommends Ubuntu 24.04, suggesting potential compatibility issues on other operating systems. Building custom Docker images is necessary for Kubernetes deployments. Specific backend compatibility details are linked but require further investigation.
19 hours ago
1 day