fastertransformer_backend  by triton-inference-server

Triton backend for optimized transformer inference

created 4 years ago
411 stars

Top 72.2% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a Triton Inference Server backend for NVIDIA's FasterTransformer library, enabling highly optimized inference for large language models like GPT-3, T5, and BERT. It targets researchers and engineers needing to deploy these models efficiently across multi-GPU and multi-node setups, offering significant performance gains through optimized attention mechanisms and parallelization strategies.

How It Works

The backend integrates FasterTransformer's optimized CUDA kernels for transformer layers into Triton. It addresses auto-regressive model challenges by managing key-value caches internally, avoiding redundant computations across inference steps. For multi-GPU/multi-node deployment, it leverages MPI for inter-node communication and multi-threading for intra-node GPU control, supporting tensor and pipeline parallelism.

Quick Start & Requirements

  • Install/Run: Requires building a custom Triton Docker image with FasterTransformer integrated. The docker/create_dockerfile_and_build.py script or manual docker build commands are provided.
  • Prerequisites: NVIDIA GPUs, CUDA, Docker. Specific Triton and FasterTransformer versions are tied to the CONTAINER_VERSION (e.g., 23.04).
  • Setup: Building the Docker image is the primary setup step.
  • Docs: Triton Backend Repo

Highlighted Details

  • Supports FP16 and BF16 precision for various models including GPT, T5, BERT, BLOOM, GPT-J, and GPT-NeoX.
  • Enables tensor and pipeline parallelism for scaling inference across multiple GPUs and nodes.
  • Allows multiple model instances on the same GPUs, sharing weights to optimize memory usage.
  • Offers flexibility in configuring GPU topology and NCCL communication modes (GROUP vs. PARALLEL).

Maintenance & Community

Development has transitioned to TensorRT-LLM. This repository is noted as a research and prototyping tool, not a formally maintained product. Questions and issues should be reported on the issues page.

Licensing & Compatibility

The repository itself appears to be Apache 2.0 licensed, but it integrates FasterTransformer, which is typically subject to NVIDIA's licensing terms. Compatibility for commercial use depends on the underlying FasterTransformer license.

Limitations & Caveats

The project is in a research/prototyping phase and is no longer under active development, with all efforts redirected to TensorRT-LLM. It explicitly states it is not a formal product or maintained framework.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 90 days

Explore Similar Projects

Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Feedback? Help us improve.