FasterTransformer  by NVIDIA

Optimized transformer library for inference

created 4 years ago
6,262 stars

Top 8.4% on sourcepulse

GitHubView on GitHub
Project Summary

NVIDIA FasterTransformer provides highly optimized Transformer-based encoder and decoder components for NLP inference. It targets researchers and engineers seeking to maximize inference performance on NVIDIA GPUs, offering significant speedups over standard framework implementations.

How It Works

FasterTransformer leverages CUDA, cuBLAS, and cuBLASLt for low-level optimizations. It implements fused kernels for common operations like multi-head attention and feed-forward networks, automatically utilizing Tensor Cores for FP16 precision on supported GPUs. The library is designed for seamless integration via custom TensorFlow and PyTorch operations, as well as a Triton backend.

Quick Start & Requirements

  • Install: Build from source. Instructions are provided in the README.
  • Prerequisites: CUDA Toolkit, cuBLAS, cuDNN, C++ compiler (GCC/G++ 4.8 recommended for TensorFlow ops), Python, TensorFlow or PyTorch. Tensor Cores require Volta, Turing, or Ampere GPUs. FP8 support requires Hopper.
  • Setup Time: Building from source can take 30-60 minutes depending on system configuration.
  • Links: Documentation, Support Matrix

Highlighted Details

  • Supports FP16, INT8, and Sparsity (Ampere+). FP8 support is experimental.
  • Offers Tensor and Pipeline parallelism for models like BERT, GPT, T5, and BLOOM.
  • Provides significant speedups (up to 5x-18x reported) over native TensorFlow and PyTorch implementations for various models and tasks.
  • Includes optimized kernels for decoding strategies like beam search and sampling.

Maintenance & Community

Development has transitioned to TensorRT-LLM. FasterTransformer will remain available but will not receive further updates.

Licensing & Compatibility

Apache 2.0 License. Compatible with commercial and closed-source applications.

Limitations & Caveats

  • Development has ceased in favor of TensorRT-LLM.
  • Compilation issues may arise with newer TensorFlow versions (e.g., 2.10) due to undefined symbols.
  • TensorFlow and PyTorch op results may differ due to accumulated log probability handling.
  • Specific GCC versions might be required for TensorFlow op builds.
Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
131 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 11 hours ago
Feedback? Help us improve.