aiter  by ROCm

High-performance AI operator library for ROCm

Created 1 year ago
418 stars

Top 70.0% on SourcePulse

GitHubView on GitHub
Project Summary

AI Tensor Engine for ROCm (AITER) is AMD's high-performance operator library, delivering optimized GPU kernels for AI inference and training on ROCm-enabled hardware. It provides framework developers with a unified, production-ready collection of operators, significantly boosting performance for AI workloads on AMD Instinct GPUs.

How It Works

AITER employs multiple kernel backends, including Triton, Composable Kernel (CK), and hand-tuned Assembly, to achieve peak performance. It offers both C++ and Python APIs and supports a broad range of AI tasks from inference to training, featuring fused GEMM and communication kernels. Its framework-agnostic design allows seamless integration into popular AI serving stacks like vLLM and SGLang, as well as custom solutions.

Quick Start & Requirements

Clone the repository recursively: git clone --recursive https://github.com/ROCm/aiter.git. Navigate into the directory and install using python3 setup.py develop. Key dependencies include a ROCm environment and compatible AMD GPUs (MI300X, MI325X, MI350, MI355X). Optional dependencies for advanced features like FlyDSL or Triton-based communication can be installed via requirements.txt. Official documentation is available at https://rocm.github.io/aiter.

Highlighted Details

  • Achieves significant speedups, with up to 17x for the MLA decode kernel and up to 14x for the MHA prefill kernel.
  • Offers up to 3x speedup for Block-scaled Fused MoE and up to 2x for Block-scaled GEMM.
  • Serves as the default kernel backend for LLM inference on AMD GPUs within major frameworks like vLLM and SGLang.
  • Fully supports AMD Instinct MI300X, MI325X (CDNA3), MI350, and MI355X (CDNA4) GPUs.

Maintenance & Community

The project shows active development with recent releases (e.g., v0.1.12.post1 in April 2026) and frequent news updates detailing new features and integrations. No specific community channels (like Discord or Slack) are listed in the provided text.

Licensing & Compatibility

The specific license type and any associated compatibility notes for commercial use or closed-source linking are not detailed in the provided README content.

Limitations & Caveats

The JAX integration is currently marked as 'Experimental'. The library is specifically designed for and optimized for AMD's ROCm ecosystem and hardware.

Health Check
Last Commit

5 hours ago

Responsiveness

Inactive

Pull Requests (30d)
376
Issues (30d)
37
Star History
26 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

Liger-Kernel by linkedin

0.2%
6k
Triton kernels for efficient LLM training
Created 1 year ago
Updated 20 hours ago
Feedback? Help us improve.