client  by triton-inference-server

SDK for simplifying Triton inference server communication (C++, Python, Java)

created 4 years ago
636 stars

Top 53.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides client libraries and examples for interacting with the Triton Inference Server. It offers C++, Python, and Java APIs to facilitate communication via HTTP/REST or gRPC, enabling inference, status checks, and model repository management. The libraries also support efficient data transfer using system and CUDA shared memory.

How It Works

The core of the project lies in its robust client libraries, which abstract the complexities of network communication (HTTP/REST and gRPC) with the Triton Inference Server. They provide convenient interfaces for sending inference requests, managing model lifecycles, and retrieving server status. A key advantage is the support for shared memory (system and CUDA), which bypasses data serialization/deserialization overhead for significant performance gains, especially with large inputs/outputs.

Quick Start & Requirements

  • Python: pip install tritonclient[all] (installs HTTP/REST, gRPC, and CUDA shared memory support).
  • C++/Java: Download pre-built libraries from GitHub releases or build from source using CMake.
  • Docker: Pull nvcr.io/nvidia/tritonserver:<xx.yy>-py3-sdk from NGC.
  • Prerequisites: Python 3.x, CMake, Maven/JDK (for Java client build), Docker (for NGC image). CUDA shared memory requires a compatible NVIDIA driver.
  • Links: Triton Client Libraries

Highlighted Details

  • Supports HTTP/REST and gRPC protocols.
  • Enables efficient data transfer via System Shared Memory and CUDA Shared Memory.
  • Includes example applications for image classification and ensemble models.
  • Offers Python AsyncIO support (Beta) and a Client Plugin API (Beta) for custom request header manipulation.
  • Supports ORCA Header Metrics for KV-cache utilization and capacity.
  • Provides client-side compression and SSL/TLS configuration.

Maintenance & Community

The project is actively maintained by the Triton Inference Server team. Questions and issues can be reported on the main Triton issues page.

Licensing & Compatibility

The client libraries are typically distributed under a permissive license (e.g., Apache 2.0, similar to the server), allowing for commercial use and integration into closed-source applications.

Limitations & Caveats

The Java API currently supports a limited feature subset. Python AsyncIO and Client Plugin API features are in Beta and subject to change. When using CUDA shared memory with Docker, the --pid host flag is required for containers.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
1
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
2 more.

serve by pytorch

0.1%
4k
Serve, optimize, and scale PyTorch models in production
created 5 years ago
updated 3 weeks ago
Feedback? Help us improve.