Optimized transformer library for inference
Top 8.4% on sourcepulse
NVIDIA FasterTransformer provides highly optimized Transformer-based encoder and decoder components for NLP inference. It targets researchers and engineers seeking to maximize inference performance on NVIDIA GPUs, offering significant speedups over standard framework implementations.
How It Works
FasterTransformer leverages CUDA, cuBLAS, and cuBLASLt for low-level optimizations. It implements fused kernels for common operations like multi-head attention and feed-forward networks, automatically utilizing Tensor Cores for FP16 precision on supported GPUs. The library is designed for seamless integration via custom TensorFlow and PyTorch operations, as well as a Triton backend.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
Development has transitioned to TensorRT-LLM. FasterTransformer will remain available but will not receive further updates.
Licensing & Compatibility
Apache 2.0 License. Compatible with commercial and closed-source applications.
Limitations & Caveats
1 year ago
Inactive