SDK for simplifying Triton inference server communication (C++, Python, Java)
Top 53.1% on sourcepulse
This repository provides client libraries and examples for interacting with the Triton Inference Server. It offers C++, Python, and Java APIs to facilitate communication via HTTP/REST or gRPC, enabling inference, status checks, and model repository management. The libraries also support efficient data transfer using system and CUDA shared memory.
How It Works
The core of the project lies in its robust client libraries, which abstract the complexities of network communication (HTTP/REST and gRPC) with the Triton Inference Server. They provide convenient interfaces for sending inference requests, managing model lifecycles, and retrieving server status. A key advantage is the support for shared memory (system and CUDA), which bypasses data serialization/deserialization overhead for significant performance gains, especially with large inputs/outputs.
Quick Start & Requirements
pip install tritonclient[all]
(installs HTTP/REST, gRPC, and CUDA shared memory support).nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk
from NGC.Highlighted Details
Maintenance & Community
The project is actively maintained by the Triton Inference Server team. Questions and issues can be reported on the main Triton issues page.
Licensing & Compatibility
The client libraries are typically distributed under a permissive license (e.g., Apache 2.0, similar to the server), allowing for commercial use and integration into closed-source applications.
Limitations & Caveats
The Java API currently supports a limited feature subset. Python AsyncIO and Client Plugin API features are in Beta and subject to change. When using CUDA shared memory with Docker, the --pid host
flag is required for containers.
1 week ago
1 day