AITemplate by facebookincubator

Generate high-performance inference engines

Created 3 years ago

4,695 stars

Top 10.5% on SourcePulse

View on GitHub

24 Experts Love This Project

Nat Friedman

Former CEO of GitHub

Jeff Hammerbacher

Cofounder of Cloudera

Travis Fischer

Founder of Agentic

Jonathan Ragan-Kelley

Professor at MIT

and 20 more!

Project Summary

AITemplate is a Python framework designed to compile deep neural networks into highly optimized CUDA (NVIDIA) and HIP (AMD) C++ code for accelerated inference. It targets developers seeking near-roofline FP16 performance on NVIDIA TensorCore and AMD MatrixCore architectures, offering a unified, flexible, and open-source solution for deploying models like ResNet, BERT, and Stable Diffusion efficiently across different GPU platforms.

How It Works

AITemplate generates self-contained, portable C++ binaries for inference, eliminating dependencies on external runtimes such as TensorRT or cuDNN. Its performance advantage stems from advanced kernel fusion techniques: horizontal fusion combines parallel operators with varying input shapes; vertical fusion integrates elementwise operations, reductions, and layout permutations into TensorCore/MatrixCore operations; and memory fusion merges operators with subsequent memory manipulations like concatenation or slicing. This approach maximizes GPU utilization and operator coverage.

Quick Start & Requirements

Installation involves cloning the repository with submodules (git clone --recursive https://github.com/facebookincubator/AITemplate). Building a Python wheel requires a compatible compiler (CUDA 11.6 or ROCm 5.2.3 tested). Docker images are recommended for managing compiler environments. Hardware requirements include NVIDIA SM80+ GPUs (Ampere and newer) and AMD CDNA2 GPUs (MI-210/250); older architectures may encounter compatibility issues. Official documentation and onboarding tutorials are available.

Highlighted Details

Achieves near-roofline FP16 performance on major models using TensorCore/MatrixCore.
Implements advanced horizontal, vertical, and memory fusion for enhanced operator integration.
Provides a Python runtime that seamlessly integrates with PyTorch tensors without extra data copies.
The FX2AIT tool facilitates conversion of PyTorch models, offering partial acceleration for unsupported operators via its AITLowerer.

Maintenance & Community

AITemplate is actively maintained by Meta engineers, with significant contributions from a broader team. The project collaborates closely with NVIDIA's CUTLASS and AMD's Composable Kernel teams to co-design GPU optimizations.

Licensing & Compatibility

AITemplate is released under the permissive Apache 2.0 License, allowing for broad compatibility with commercial and closed-source applications.

Limitations & Caveats

The framework is primarily tested on specific, modern GPU architectures (NVIDIA SM80+, AMD CDNA2), and performance or compatibility may be reduced on older hardware. Correct compiler versions are crucial for achieving optimal performance. While FX2AIT extends support, not all PyTorch operators are natively integrated into AITemplate.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days