apex by NVIDIA

PyTorch extension for streamlined mixed precision & distributed training

Created 7 years ago

8,890 stars

Top 5.8% on SourcePulse

View on GitHub

23 Experts Love This Project

Aravind Srinivas

Cofounder of Perplexity

Johannes Hagemann

Cofounder of Prime Intellect

Sam Partee

Cofounder of Arcade

Travis Addair

Cofounder of Predibase

and 19 more!

Project Summary

NVIDIA Apex provides PyTorch extensions for streamlined mixed-precision and distributed training, targeting researchers and engineers seeking to accelerate deep learning workflows. It offers utilities for automatic mixed precision (AMP) and optimized distributed data parallelism, aiming to simplify complex training configurations and improve performance.

How It Works

Apex historically offered apex.amp for automatic mixed precision, simplifying the integration of FP16 training by modifying only a few lines of code. It also provided apex.parallel.DistributedDataParallel for efficient multi-process distributed training, optimized with NVIDIA's NCCL communication library. While these specific modules are now deprecated in favor of upstream PyTorch equivalents, Apex continues to offer specialized fused kernels and optimized implementations for specific operations, particularly within its apex.contrib modules.

Quick Start & Requirements

Installation: Recommended installation from source with CUDA and C++ extensions:

git clone https://github.com/NVIDIA/apex
cd apex
# For pip >= 23.1
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# Or for older pip
# pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

A Python-only build is available via pip install -v --no-cache-dir ./.

Prerequisites: CUDA Toolkit, C++ compiler (GCC/Clang), Ninja (recommended for faster compilation). Specific apex.contrib modules may require newer CUDA versions (e.g., CUDA >= 11 for fused_weight_gradient_mlp_cuda, cuDNN >= 8.5 for cudnn_gbn_lib).
Resources: Compilation can be resource-intensive. Building with --parallel or --threads flags can speed up compilation.
Documentation: https://nvidia.github.io/apex

Highlighted Details

Provides fused kernels for performance-critical operations (e.g., LayerNorm, Adam, multi-head attention).
Includes specialized implementations for distributed training optimizations beyond standard PyTorch.
Offers utilities for synchronized batch normalization and custom loss functions within apex.contrib.
Supports installation via NVIDIA NGC PyTorch containers.

Maintenance & Community

Maintained by NVIDIA.
Some components are deprecated in favor of upstream PyTorch.
Check the repository for community contributions and potential forks.

Licensing & Compatibility

License: BSD-3-Clause.
Compatible with commercial and closed-source applications.

Limitations & Caveats

Core AMP and DistributedDataParallel utilities are deprecated, with users advised to migrate to PyTorch's native implementations.
apex.contrib modules may not support all PyTorch releases and might have specific dependency requirements (e.g., newer CUDA/cuDNN versions).
Windows support is experimental and may require building PyTorch from source.

Health Check

Last Commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days