pytriton by triton-inference-server

Python framework for deploying ML models with Triton Inference Server

Created 3 years ago

829 stars

Top 42.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jesse Clark

Cofounder of Marqo

Project Summary

Summary

PyTriton addresses the challenge of deploying machine learning models using NVIDIA's Triton Inference Server within Python-centric workflows. It offers a Flask/FastAPI-like framework, enabling developers to serve models directly from Python code with ease. This simplifies the integration of Triton into existing Python applications and ML pipelines, providing a familiar interface while leveraging Triton's high-performance inference capabilities.

How It Works

PyTriton acts as a Pythonic wrapper around the Triton Inference Server, abstracting away much of the complexity. It allows users to define inference logic using standard Python functions, which can then be exposed as HTTP or gRPC APIs. The framework is agnostic to underlying ML libraries, supporting popular choices like PyTorch, TensorFlow, and JAX. Key performance features like dynamic batching, response caching, and model pipelining are accessible through decorators and configurations, aiming to maximize throughput and minimize latency.

Quick Start & Requirements

Installation is straightforward via pip: pip install nvidia-pytriton. Prerequisites include Python 3.8+, pip 20.3+, and a compatible operating system with glibc version 2.35 or higher (tested on Ubuntu 22.04, also supports Debian 11+, Rocky Linux 9+, UBI 9+). Ensure libpython3.*.so is installed. Detailed installation, Docker usage, and building from source instructions are available in the documentation.

Highlighted Details

Native Python Deployment: Expose any Python function as an HTTP/gRPC API endpoint.
Framework Agnostic: Seamlessly deploy models from PyTorch, TensorFlow, JAX, and other Python frameworks.
Performance Optimizations: Utilizes Triton's dynamic batching, response caching, model pipelining, and performance tracing.
Developer Experience: Features decorators for simplifying batching and pre-processing, alongside high-level model clients for synchronous and asynchronous requests.
Streaming (Alpha): Supports streaming partial responses in a decoupled mode.

Maintenance & Community

The provided README does not detail specific community channels (like Discord/Slack), notable contributors, sponsorships, or a public roadmap. Links to "Contributing" and "Known Issues" are mentioned but not provided.

Licensing & Compatibility

The licensing information is not specified in the provided README content.

Limitations & Caveats

The streaming functionality is currently in an alpha state. Operating system compatibility requires specific glibc versions and distributions, with Ubuntu 22.04 being the primary tested environment.

Health Check

Last Commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days