dynamo by ai-dynamo

Inference framework for distributed generative AI model serving

Created 10 months ago

5,755 stars

Top 8.7% on SourcePulse

View on GitHub

12 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Carol Willing

Core Contributor to CPython, Jupyter

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Luis Capelo

Cofounder of Lightning AI

and 8 more!

Project Summary

NVIDIA Dynamo is a distributed inference serving framework designed for high-throughput, low-latency serving of generative AI and reasoning models across multiple nodes. It targets users needing to deploy LLMs at scale, offering features like disaggregated prefill/decode, dynamic GPU scheduling, and KV cache offloading to optimize performance and resource utilization.

How It Works

Dynamo employs a disaggregated architecture, separating prefill and decode stages to maximize GPU utilization and allow flexible throughput/latency trade-offs. It features LLM-aware request routing for efficient KV cache management and dynamic GPU scheduling to adapt to fluctuating workloads. Built with Rust for performance and Python for extensibility, it supports multiple inference backends (TRT-LLM, vLLM, SGLang) and utilizes NIXL for accelerated data transfer.

Quick Start & Requirements

Install: pip install ai-dynamo[all]
Prerequisites: Ubuntu 24.04 (recommended), python3-dev, python3-pip, python3-venv, libucx0. CUDA and specific inference backends may require additional setup.
Resources: Building the base Docker image is required for Kubernetes deployment. Local testing involves docker compose.
Docs: Roadmap, Support Matrix, Guides

Highlighted Details

Supports multiple inference backends (TRT-LLM, vLLM, SGLang).
Disaggregated prefill and decode for throughput/latency optimization.
Dynamic GPU scheduling and LLM-aware request routing.
KV cache offloading for enhanced system throughput.
OpenAI-compatible frontend.

Maintenance & Community

Open-source first development approach.
Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license. It mentions "fully open-source" and "OSS (Open Source Software) first development approach."

Limitations & Caveats

The README recommends Ubuntu 24.04, suggesting potential compatibility issues on other operating systems. Building custom Docker images is necessary for Kubernetes deployments. Specific backend compatibility details are linked but require further investigation.

Health Check

Last Commit

12 hours ago

Responsiveness

Inactive

Pull Requests (30d)

417

Issues (30d)

Star History

130 stars in the last 30 days