transformer-deploy  by ELS-RD

CLI tool for optimized Hugging Face Transformer deployment

created 3 years ago
1,688 stars

Top 25.7% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an enterprise-grade inference server for Hugging Face Transformer models, targeting engineers and researchers who need to deploy NLP models efficiently. It offers significant latency improvements (up to 10x faster) by optimizing models for CPU and GPU inference using ONNX Runtime and NVIDIA TensorRT, and integrates seamlessly with the NVIDIA Triton Inference Server.

How It Works

The core approach involves converting Hugging Face Transformer models into optimized ONNX or TensorRT formats. This conversion process, handled by a single command-line tool, applies optimizations like kernel fusion and mixed precision. The optimized models are then deployed via the NVIDIA Triton Inference Server, which provides a robust, scalable, and production-ready serving solution. This combination leverages the performance benefits of TensorRT and ONNX Runtime over standard PyTorch/FastAPI deployments.

Quick Start & Requirements

  • Install/Run: Primarily uses Docker images (ghcr.io/els-rd/transformer-deploy:0.6.0).
  • Prerequisites: For GPU usage, NVIDIA drivers and NVIDIA Container Toolkit are required.
  • Setup: Cloning the repository and pulling the Docker image is the initial step. Detailed examples are provided for classification, token classification, feature extraction, and text generation.
  • Links: Demo Notebooks

Highlighted Details

  • Achieves 5x-10x speedup over vanilla PyTorch for Transformer inference.
  • Supports deployment to NVIDIA Triton Inference Server.
  • Offers quantization support for both CPU and GPU.
  • Handles various tasks: document classification, token classification (NER), feature extraction, text generation.
  • Models must be exportable to ONNX format.

Maintenance & Community

  • The project is maintained by ELS-RD.
  • Further details on community or roadmap are not explicitly provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • For GPU inference, NVIDIA hardware and software stack are mandatory.
  • Some optimizations, particularly for large models or specific tasks like text generation, may require careful parameter tuning (e.g., absolute tolerance) and can be time-consuming.
  • The README notes that PyTorch is "never competitive" for Transformer inference, highlighting a strong bias towards ONNX/TensorRT.
Health Check
Last commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 16 hours ago
Feedback? Help us improve.