transformer-deploy  by ELS-RD

CLI tool for optimized Hugging Face Transformer deployment

Created 3 years ago
1,689 stars

Top 25.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an enterprise-grade inference server for Hugging Face Transformer models, targeting engineers and researchers who need to deploy NLP models efficiently. It offers significant latency improvements (up to 10x faster) by optimizing models for CPU and GPU inference using ONNX Runtime and NVIDIA TensorRT, and integrates seamlessly with the NVIDIA Triton Inference Server.

How It Works

The core approach involves converting Hugging Face Transformer models into optimized ONNX or TensorRT formats. This conversion process, handled by a single command-line tool, applies optimizations like kernel fusion and mixed precision. The optimized models are then deployed via the NVIDIA Triton Inference Server, which provides a robust, scalable, and production-ready serving solution. This combination leverages the performance benefits of TensorRT and ONNX Runtime over standard PyTorch/FastAPI deployments.

Quick Start & Requirements

  • Install/Run: Primarily uses Docker images (ghcr.io/els-rd/transformer-deploy:0.6.0).
  • Prerequisites: For GPU usage, NVIDIA drivers and NVIDIA Container Toolkit are required.
  • Setup: Cloning the repository and pulling the Docker image is the initial step. Detailed examples are provided for classification, token classification, feature extraction, and text generation.
  • Links: Demo Notebooks

Highlighted Details

  • Achieves 5x-10x speedup over vanilla PyTorch for Transformer inference.
  • Supports deployment to NVIDIA Triton Inference Server.
  • Offers quantization support for both CPU and GPU.
  • Handles various tasks: document classification, token classification (NER), feature extraction, text generation.
  • Models must be exportable to ONNX format.

Maintenance & Community

  • The project is maintained by ELS-RD.
  • Further details on community or roadmap are not explicitly provided in the README.

Licensing & Compatibility

  • The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

  • For GPU inference, NVIDIA hardware and software stack are mandatory.
  • Some optimizations, particularly for large models or specific tasks like text generation, may require careful parameter tuning (e.g., absolute tolerance) and can be time-consuming.
  • The README notes that PyTorch is "never competitive" for Transformer inference, highlighting a strong bias towards ONNX/TensorRT.
Health Check
Last Commit

11 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 3 years ago
Updated 3 years ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Tobi Lutke Tobi Lutke(Cofounder of Shopify), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
11 more.

ctransformers by marella

0.1%
2k
Python bindings for fast Transformer model inference
Created 2 years ago
Updated 1 year ago
Starred by Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI) and Cody Yu Cody Yu(Coauthor of vLLM; MTS at OpenAI).

xDiT by xdit-project

0.7%
2k
Inference engine for parallel Diffusion Transformer (DiT) deployment
Created 1 year ago
Updated 1 day ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.4%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 18 hours ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
4 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
Created 1 year ago
Updated 2 days ago
Feedback? Help us improve.