transformer-deploy by ELS-RD

CLI tool for optimized Hugging Face Transformer deployment

Created 4 years ago

1,689 stars

Top 24.9% on SourcePulse

View on GitHub

11 Experts Love This Project

Casper Hansen

Author of AutoAWQ

Amanpreet Singh

Cofounder of Contextual AI

Luis Capelo

Cofounder of Lightning AI

Piotr Dąbkowski

Cofounder of ElevenLabs

and 7 more!

Project Summary

This project provides an enterprise-grade inference server for Hugging Face Transformer models, targeting engineers and researchers who need to deploy NLP models efficiently. It offers significant latency improvements (up to 10x faster) by optimizing models for CPU and GPU inference using ONNX Runtime and NVIDIA TensorRT, and integrates seamlessly with the NVIDIA Triton Inference Server.

How It Works

The core approach involves converting Hugging Face Transformer models into optimized ONNX or TensorRT formats. This conversion process, handled by a single command-line tool, applies optimizations like kernel fusion and mixed precision. The optimized models are then deployed via the NVIDIA Triton Inference Server, which provides a robust, scalable, and production-ready serving solution. This combination leverages the performance benefits of TensorRT and ONNX Runtime over standard PyTorch/FastAPI deployments.

Quick Start & Requirements

Install/Run: Primarily uses Docker images (ghcr.io/els-rd/transformer-deploy:0.6.0).
Prerequisites: For GPU usage, NVIDIA drivers and NVIDIA Container Toolkit are required.
Setup: Cloning the repository and pulling the Docker image is the initial step. Detailed examples are provided for classification, token classification, feature extraction, and text generation.
Links: Demo Notebooks

Highlighted Details

Achieves 5x-10x speedup over vanilla PyTorch for Transformer inference.
Supports deployment to NVIDIA Triton Inference Server.
Offers quantization support for both CPU and GPU.
Handles various tasks: document classification, token classification (NER), feature extraction, text generation.
Models must be exportable to ONNX format.

Maintenance & Community

The project is maintained by ELS-RD.
Further details on community or roadmap are not explicitly provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

For GPU inference, NVIDIA hardware and software stack are mandatory.
Some optimizations, particularly for large models or specific tasks like text generation, may require careful parameter tuning (e.g., absolute tolerance) and can be time-consuming.
The README notes that PyTorch is "never competitive" for Transformer inference, highlighting a strong bias towards ONNX/TensorRT.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days