FasterTransformer by NVIDIA

Optimized transformer library for inference

Created 4 years ago

6,379 stars

Top 8.0% on SourcePulse

View on GitHub

17 Experts Love This Project

Nat Friedman

Former CEO of GitHub

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Junyang Lin

Core Maintainer at Alibaba Qwen

Morgan Funtowicz

Head of ML Optimizations at Hugging Face

and 13 more!

Project Summary

NVIDIA FasterTransformer provides highly optimized Transformer-based encoder and decoder components for NLP inference. It targets researchers and engineers seeking to maximize inference performance on NVIDIA GPUs, offering significant speedups over standard framework implementations.

How It Works

FasterTransformer leverages CUDA, cuBLAS, and cuBLASLt for low-level optimizations. It implements fused kernels for common operations like multi-head attention and feed-forward networks, automatically utilizing Tensor Cores for FP16 precision on supported GPUs. The library is designed for seamless integration via custom TensorFlow and PyTorch operations, as well as a Triton backend.

Quick Start & Requirements

Install: Build from source. Instructions are provided in the README.
Prerequisites: CUDA Toolkit, cuBLAS, cuDNN, C++ compiler (GCC/G++ 4.8 recommended for TensorFlow ops), Python, TensorFlow or PyTorch. Tensor Cores require Volta, Turing, or Ampere GPUs. FP8 support requires Hopper.
Setup Time: Building from source can take 30-60 minutes depending on system configuration.
Links: Documentation, Support Matrix

Highlighted Details

Supports FP16, INT8, and Sparsity (Ampere+). FP8 support is experimental.
Offers Tensor and Pipeline parallelism for models like BERT, GPT, T5, and BLOOM.
Provides significant speedups (up to 5x-18x reported) over native TensorFlow and PyTorch implementations for various models and tasks.
Includes optimized kernels for decoding strategies like beam search and sampling.

Maintenance & Community

Development has transitioned to TensorRT-LLM. FasterTransformer will remain available but will not receive further updates.

Licensing & Compatibility

Apache 2.0 License. Compatible with commercial and closed-source applications.

Limitations & Caveats

Development has ceased in favor of TensorRT-LLM.
Compilation issues may arise with newer TensorFlow versions (e.g., 2.10) due to undefined symbols.
TensorFlow and PyTorch op results may differ due to accumulated log probability handling.
Specific GCC versions might be required for TensorFlow op builds.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days