dynamo  by ai-dynamo

Inference framework for distributed generative AI model serving

Created 11 months ago
6,127 stars

Top 8.3% on SourcePulse

GitHubView on GitHub
Project Summary

NVIDIA Dynamo is a distributed inference serving framework designed for high-throughput, low-latency serving of generative AI and reasoning models across multiple nodes. It targets users needing to deploy LLMs at scale, offering features like disaggregated prefill/decode, dynamic GPU scheduling, and KV cache offloading to optimize performance and resource utilization.

How It Works

Dynamo employs a disaggregated architecture, separating prefill and decode stages to maximize GPU utilization and allow flexible throughput/latency trade-offs. It features LLM-aware request routing for efficient KV cache management and dynamic GPU scheduling to adapt to fluctuating workloads. Built with Rust for performance and Python for extensibility, it supports multiple inference backends (TRT-LLM, vLLM, SGLang) and utilizes NIXL for accelerated data transfer.

Quick Start & Requirements

  • Install: pip install ai-dynamo[all]
  • Prerequisites: Ubuntu 24.04 (recommended), python3-dev, python3-pip, python3-venv, libucx0. CUDA and specific inference backends may require additional setup.
  • Resources: Building the base Docker image is required for Kubernetes deployment. Local testing involves docker compose.
  • Docs: Roadmap, Support Matrix, Guides

Highlighted Details

  • Supports multiple inference backends (TRT-LLM, vLLM, SGLang).
  • Disaggregated prefill and decode for throughput/latency optimization.
  • Dynamic GPU scheduling and LLM-aware request routing.
  • KV cache offloading for enhanced system throughput.
  • OpenAI-compatible frontend.

Maintenance & Community

  • Open-source first development approach.
  • Community support channels are not explicitly mentioned in the README.

Licensing & Compatibility

  • The README does not explicitly state the license. It mentions "fully open-source" and "OSS (Open Source Software) first development approach."

Limitations & Caveats

The README recommends Ubuntu 24.04, suggesting potential compatibility issues on other operating systems. Building custom Docker images is necessary for Kubernetes deployments. Specific backend compatibility details are linked but require further investigation.

Health Check
Last Commit

16 hours ago

Responsiveness

Inactive

Pull Requests (30d)
879
Issues (30d)
178
Star History
165 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Chaoyu Yang Chaoyu Yang(Founder of Bento), and
3 more.

llm-d by llm-d

0.7%
3k
Kubernetes-native framework for distributed LLM inference
Created 10 months ago
Updated 1 day ago
Feedback? Help us improve.