AITemplate  by facebookincubator

Generate high-performance inference engines

Created 3 years ago
4,685 stars

Top 10.5% on SourcePulse

GitHubView on GitHub
Project Summary

AITemplate is a Python framework designed to compile deep neural networks into highly optimized CUDA (NVIDIA) and HIP (AMD) C++ code for accelerated inference. It targets developers seeking near-roofline FP16 performance on NVIDIA TensorCore and AMD MatrixCore architectures, offering a unified, flexible, and open-source solution for deploying models like ResNet, BERT, and Stable Diffusion efficiently across different GPU platforms.

How It Works

AITemplate generates self-contained, portable C++ binaries for inference, eliminating dependencies on external runtimes such as TensorRT or cuDNN. Its performance advantage stems from advanced kernel fusion techniques: horizontal fusion combines parallel operators with varying input shapes; vertical fusion integrates elementwise operations, reductions, and layout permutations into TensorCore/MatrixCore operations; and memory fusion merges operators with subsequent memory manipulations like concatenation or slicing. This approach maximizes GPU utilization and operator coverage.

Quick Start & Requirements

Installation involves cloning the repository with submodules (git clone --recursive https://github.com/facebookincubator/AITemplate). Building a Python wheel requires a compatible compiler (CUDA 11.6 or ROCm 5.2.3 tested). Docker images are recommended for managing compiler environments. Hardware requirements include NVIDIA SM80+ GPUs (Ampere and newer) and AMD CDNA2 GPUs (MI-210/250); older architectures may encounter compatibility issues. Official documentation and onboarding tutorials are available.

Highlighted Details

  • Achieves near-roofline FP16 performance on major models using TensorCore/MatrixCore.
  • Implements advanced horizontal, vertical, and memory fusion for enhanced operator integration.
  • Provides a Python runtime that seamlessly integrates with PyTorch tensors without extra data copies.
  • The FX2AIT tool facilitates conversion of PyTorch models, offering partial acceleration for unsupported operators via its AITLowerer.

Maintenance & Community

AITemplate is actively maintained by Meta engineers, with significant contributions from a broader team. The project collaborates closely with NVIDIA's CUTLASS and AMD's Composable Kernel teams to co-design GPU optimizations.

Licensing & Compatibility

AITemplate is released under the permissive Apache 2.0 License, allowing for broad compatibility with commercial and closed-source applications.

Limitations & Caveats

The framework is primarily tested on specific, modern GPU architectures (NVIDIA SM80+, AMD CDNA2), and performance or compatibility may be reduced on older hardware. Correct compiler versions are crucial for achieving optimal performance. While FX2AIT extends support, not all PyTorch operators are natively integrated into AITemplate.

Health Check
Last Commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
1
Star History
13 stars in the last 30 days

Explore Similar Projects

Starred by Shengjia Zhao Shengjia Zhao(Chief Scientist at Meta Superintelligence Lab), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
14 more.

BIG-bench by google

0.1%
3k
Collaborative benchmark for probing and extrapolating LLM capabilities
Created 4 years ago
Updated 1 year ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
16 more.

text-to-text-transfer-transformer by google-research

0.1%
6k
Unified text-to-text transformer for NLP research
Created 6 years ago
Updated 5 months ago
Feedback? Help us improve.