TransformerEngine  by NVIDIA

Library for Transformer model acceleration on NVIDIA GPUs

created 2 years ago
2,591 stars

Top 18.5% on sourcepulse

GitHubView on GitHub
Project Summary

NVIDIA Transformer Engine is a library designed to accelerate Transformer model training and inference on NVIDIA GPUs by leveraging FP8 precision. It targets researchers and engineers working with large language models (LLMs) and other Transformer architectures, offering significant performance gains and reduced memory usage.

How It Works

Transformer Engine provides optimized building blocks and an automatic mixed-precision API for seamless integration into existing deep learning workflows. It manages FP8 scaling factors and other necessary values internally, simplifying the adoption of FP8 precision, which offers improved performance over FP16 without accuracy degradation on compatible NVIDIA hardware.

Quick Start & Requirements

  • Installation: Recommended via Docker images from NGC Catalog. Pip installation is also supported: pip install --no-build-isolation transformer_engine[pytorch,jax].
  • Prerequisites:
    • Hardware: NVIDIA GPUs with Compute Capability 8.9+ (Hopper, Ada, Blackwell) for FP8 features. Ampere and later for FP16/BF16 optimizations.
    • OS: Linux (WSL2 with limited support).
    • Software: CUDA 12.1+ (12.8+ for Blackwell), cuDNN 9.3+, GCC 9+/Clang 10+, Python 3.12.
    • Source Build: CMake 3.18+, Ninja, Git 2.17+, pybind11 2.6.0+.
  • Resources: Docker images provide pre-configured environments. Pip installation requires a C++ compiler and CUDA toolkit. Compilation with FlashAttention can be resource-intensive.
  • Docs: User Guide, Examples, Quickstart Notebook.

Highlighted Details

  • Supports FP8 precision on NVIDIA Hopper, Ada, and Blackwell GPUs for enhanced performance and memory efficiency.
  • Provides optimized kernels and an automatic mixed-precision API for PyTorch and JAX.
  • Demonstrates FP8 convergence with no significant difference in training loss curves compared to BF16 across various LLM architectures.
  • Integrated with major frameworks like DeepSpeed, Hugging Face Accelerate, and NVIDIA NeMo.

Maintenance & Community

  • Actively maintained by NVIDIA.
  • Integrations with numerous popular LLM frameworks.
  • Contributing Guide.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • FP8 features require specific NVIDIA GPU architectures (Compute Capability 8.9+).
  • WSL2 support is limited.
  • FlashAttention compilation can be memory-intensive. A breaking change in v1.7 altered padding mask definitions in the PyTorch implementation.
Health Check
Last commit

20 hours ago

Responsiveness

1 week

Pull Requests (30d)
106
Issues (30d)
20
Star History
215 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 10 hours ago
Starred by Nat Friedman Nat Friedman(Former CEO of GitHub), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
6 more.

FasterTransformer by NVIDIA

0.2%
6k
Optimized transformer library for inference
created 4 years ago
updated 1 year ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
16 more.

flash-attention by Dao-AILab

0.7%
19k
Fast, memory-efficient attention implementation
created 3 years ago
updated 14 hours ago
Feedback? Help us improve.