server  by triton-inference-server

AI model inference serving optimized for cloud and edge

Created 7 years ago
9,818 stars

Top 5.1% on SourcePulse

GitHubView on GitHub
Project Summary

Triton Inference Server addresses the challenge of efficiently deploying diverse AI models across heterogeneous hardware and environments. Targeting ML engineers and MLOps professionals, it streamlines inferencing by supporting multiple frameworks and optimizing performance for real-time, batched, and streaming workloads, enabling scalable AI deployment from cloud to edge.

How It Works

Triton employs a modular architecture to serve models from frameworks like TensorRT, PyTorch, ONNX, and OpenVINO. It features concurrent model execution, dynamic batching, and sequence batching for stateful models. A key advantage is its Backend API, allowing custom operations and pre/post-processing logic, including Python-based backends. Model pipelines are facilitated via Ensembling or Business Logic Scripting (BLS), communicating through HTTP/REST and gRPC protocols.

Quick Start & Requirements

Installation is recommended via Docker containers from NVIDIA NGC. Key prerequisites include Docker and, for accelerated performance, an NVIDIA GPU. A basic setup involves cloning example models, launching the Triton server via a Docker container (nvcr.io/nvidia/tritonserver:25.08-py3), and sending inference requests using provided client examples. CPU-only deployment is also documented. Resources include tutorials, a QuickStart guide, and NVIDIA LaunchPad labs.

Highlighted Details

  • Broad framework support: TensorRT, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more.
  • Extensibility via custom C/C++ or Python backends for custom operations.
  • Advanced features: Model Ensembling, Business Logic Scripting (BLS), dynamic batching, and sequence batching.
  • Optimized for diverse inference patterns: real-time, batched, ensembles, audio/video streaming.
  • Provides detailed metrics on GPU utilization, throughput, and latency.

Maintenance & Community

Triton is a core component of NVIDIA AI Enterprise, with enterprise support available. Community engagement is fostered through GitHub Discussions. Contributions are managed via contribution guidelines, with a separate contrib repository for external additions like backends and examples.

Licensing & Compatibility

The provided README does not specify the software license. Compatibility for commercial use or closed-source linking is therefore undetermined from this document.

Limitations & Caveats

The main branch reflects under-development progress, potentially impacting stability. Support for specific backends varies across hardware platforms, requiring consultation of the Backend-Platform Support Matrix. The absence of explicit licensing information is a significant caveat for adoption decisions.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
22
Issues (30d)
25
Star History
131 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.