pytriton  by triton-inference-server

Python framework for deploying ML models with Triton Inference Server

Created 2 years ago
823 stars

Top 43.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Summary

PyTriton addresses the challenge of deploying machine learning models using NVIDIA's Triton Inference Server within Python-centric workflows. It offers a Flask/FastAPI-like framework, enabling developers to serve models directly from Python code with ease. This simplifies the integration of Triton into existing Python applications and ML pipelines, providing a familiar interface while leveraging Triton's high-performance inference capabilities.

How It Works

PyTriton acts as a Pythonic wrapper around the Triton Inference Server, abstracting away much of the complexity. It allows users to define inference logic using standard Python functions, which can then be exposed as HTTP or gRPC APIs. The framework is agnostic to underlying ML libraries, supporting popular choices like PyTorch, TensorFlow, and JAX. Key performance features like dynamic batching, response caching, and model pipelining are accessible through decorators and configurations, aiming to maximize throughput and minimize latency.

Quick Start & Requirements

Installation is straightforward via pip: pip install nvidia-pytriton. Prerequisites include Python 3.8+, pip 20.3+, and a compatible operating system with glibc version 2.35 or higher (tested on Ubuntu 22.04, also supports Debian 11+, Rocky Linux 9+, UBI 9+). Ensure libpython3.*.so is installed. Detailed installation, Docker usage, and building from source instructions are available in the documentation.

Highlighted Details

  • Native Python Deployment: Expose any Python function as an HTTP/gRPC API endpoint.
  • Framework Agnostic: Seamlessly deploy models from PyTorch, TensorFlow, JAX, and other Python frameworks.
  • Performance Optimizations: Utilizes Triton's dynamic batching, response caching, model pipelining, and performance tracing.
  • Developer Experience: Features decorators for simplifying batching and pre-processing, alongside high-level model clients for synchronous and asynchronous requests.
  • Streaming (Alpha): Supports streaming partial responses in a decoupled mode.

Maintenance & Community

The provided README does not detail specific community channels (like Discord/Slack), notable contributors, sponsorships, or a public roadmap. Links to "Contributing" and "Known Issues" are mentioned but not provided.

Licensing & Compatibility

The licensing information is not specified in the provided README content.

Limitations & Caveats

The streaming functionality is currently in an alpha state. Operating system compatibility requires specific glibc versions and distributions, with Ubuntu 22.04 being the primary tested environment.

Health Check
Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
7 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.