fastertransformer_backend by triton-inference-server

Triton backend for optimized transformer inference

Created 4 years ago

413 stars

Top 70.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ying Sheng

Coauthor of SGLang

Project Summary

This repository provides a Triton Inference Server backend for NVIDIA's FasterTransformer library, enabling highly optimized inference for large language models like GPT-3, T5, and BERT. It targets researchers and engineers needing to deploy these models efficiently across multi-GPU and multi-node setups, offering significant performance gains through optimized attention mechanisms and parallelization strategies.

How It Works

The backend integrates FasterTransformer's optimized CUDA kernels for transformer layers into Triton. It addresses auto-regressive model challenges by managing key-value caches internally, avoiding redundant computations across inference steps. For multi-GPU/multi-node deployment, it leverages MPI for inter-node communication and multi-threading for intra-node GPU control, supporting tensor and pipeline parallelism.

Quick Start & Requirements

Install/Run: Requires building a custom Triton Docker image with FasterTransformer integrated. The docker/create_dockerfile_and_build.py script or manual docker build commands are provided.
Prerequisites: NVIDIA GPUs, CUDA, Docker. Specific Triton and FasterTransformer versions are tied to the CONTAINER_VERSION (e.g., 23.04).
Setup: Building the Docker image is the primary setup step.
Docs: Triton Backend Repo

Highlighted Details

Supports FP16 and BF16 precision for various models including GPT, T5, BERT, BLOOM, GPT-J, and GPT-NeoX.
Enables tensor and pipeline parallelism for scaling inference across multiple GPUs and nodes.
Allows multiple model instances on the same GPUs, sharing weights to optimize memory usage.
Offers flexibility in configuring GPU topology and NCCL communication modes (GROUP vs. PARALLEL).

Maintenance & Community

Development has transitioned to TensorRT-LLM. This repository is noted as a research and prototyping tool, not a formally maintained product. Questions and issues should be reported on the issues page.

Licensing & Compatibility

The repository itself appears to be Apache 2.0 licensed, but it integrates FasterTransformer, which is typically subject to NVIDIA's licensing terms. Compatibility for commercial use depends on the underlying FasterTransformer license.

Limitations & Caveats

The project is in a research/prototyping phase and is no longer under active development, with all efforts redirected to TensorRT-LLM. It explicitly states it is not a formal product or maintained framework.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days