fastertransformer_backend  by triton-inference-server

Triton backend for optimized transformer inference

Created 4 years ago
412 stars

Top 71.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a Triton Inference Server backend for NVIDIA's FasterTransformer library, enabling highly optimized inference for large language models like GPT-3, T5, and BERT. It targets researchers and engineers needing to deploy these models efficiently across multi-GPU and multi-node setups, offering significant performance gains through optimized attention mechanisms and parallelization strategies.

How It Works

The backend integrates FasterTransformer's optimized CUDA kernels for transformer layers into Triton. It addresses auto-regressive model challenges by managing key-value caches internally, avoiding redundant computations across inference steps. For multi-GPU/multi-node deployment, it leverages MPI for inter-node communication and multi-threading for intra-node GPU control, supporting tensor and pipeline parallelism.

Quick Start & Requirements

  • Install/Run: Requires building a custom Triton Docker image with FasterTransformer integrated. The docker/create_dockerfile_and_build.py script or manual docker build commands are provided.
  • Prerequisites: NVIDIA GPUs, CUDA, Docker. Specific Triton and FasterTransformer versions are tied to the CONTAINER_VERSION (e.g., 23.04).
  • Setup: Building the Docker image is the primary setup step.
  • Docs: Triton Backend Repo

Highlighted Details

  • Supports FP16 and BF16 precision for various models including GPT, T5, BERT, BLOOM, GPT-J, and GPT-NeoX.
  • Enables tensor and pipeline parallelism for scaling inference across multiple GPUs and nodes.
  • Allows multiple model instances on the same GPUs, sharing weights to optimize memory usage.
  • Offers flexibility in configuring GPU topology and NCCL communication modes (GROUP vs. PARALLEL).

Maintenance & Community

Development has transitioned to TensorRT-LLM. This repository is noted as a research and prototyping tool, not a formally maintained product. Questions and issues should be reported on the issues page.

Licensing & Compatibility

The repository itself appears to be Apache 2.0 licensed, but it integrates FasterTransformer, which is typically subject to NVIDIA's licensing terms. Compatibility for commercial use depends on the underlying FasterTransformer license.

Limitations & Caveats

The project is in a research/prototyping phase and is no longer under active development, with all efforts redirected to TensorRT-LLM. It explicitly states it is not a formal product or maintained framework.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Tri Dao Tri Dao(Chief Scientist at Together AI), Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), and
1 more.

oslo by tunib-ai

0%
309
Framework for large-scale transformer optimization
Created 3 years ago
Updated 3 years ago
Starred by Luca Soldaini Luca Soldaini(Research Scientist at Ai2), Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), and
4 more.

parallelformers by tunib-ai

0%
790
Toolkit for easy model parallelization
Created 4 years ago
Updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 1 day ago
Starred by Luis Capelo Luis Capelo(Cofounder of Lightning AI), Alex Yu Alex Yu(Research Scientist at OpenAI; Former Cofounder of Luma AI), and
7 more.

TransformerEngine by NVIDIA

0.4%
3k
Library for Transformer model acceleration on NVIDIA GPUs
Created 3 years ago
Updated 19 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

FasterTransformer by NVIDIA

0.1%
6k
Optimized transformer library for inference
Created 4 years ago
Updated 1 year ago
Feedback? Help us improve.