triton-flash-attention  by hkproj

Triton implementation of Flash Attention 2

Created 1 year ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

This repository provides an implementation of the Flash Attention 2 algorithm using Triton, targeting deep learning practitioners seeking to optimize the computationally intensive attention mechanism. It addresses the memory and speed bottlenecks associated with large sequence lengths in transformer models, offering a performant alternative for researchers and engineers working with extensive datasets or complex models. The primary benefit is enabling faster training and inference with reduced memory overhead.

How It Works

The project leverages Triton, a Python-based language and compiler for writing custom GPU kernels, to achieve high performance for the Flash Attention 2 algorithm. It is based on OpenAI's Fused Attention implementation, focusing on optimizing the attention computation. A key design choice is the avoidance of materializing the full SEQ_LEN x SEQ_LEN attention matrix, which is a significant memory bottleneck in standard implementations. This approach allows the algorithm to scale efficiently to much longer sequence lengths, pushing hardware limits.

Quick Start & Requirements

Installation involves installing dependencies from triton/requirements.txt. Users must carefully configure parameters such as BATCH_SIZE, NUM_HEADS, SEQ_LEN, and HEAD_DIM to align with their hardware capabilities and prevent resource exhaustion. The project includes CUDA examples demonstrating its usage.

Highlighted Details

  • Implements the Flash Attention 2 algorithm, known for its efficiency gains.
  • Utilizes Triton for high-performance, custom GPU kernel development.
  • Includes CUDA examples for practical application.
  • Features open-ended exercises for further optimization, including autotuning the backward pass and enhancing causal attention efficiency by avoiding unnecessary computations.

Maintenance & Community

The provided README does not contain information regarding specific contributors, community channels (like Discord or Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The license under which this project is distributed is not specified in the README. Consequently, compatibility for commercial use or integration into closed-source projects cannot be determined without further clarification.

Limitations & Caveats

This implementation has not been tested on AMD hardware. Users must be mindful of potential memory constraints when dealing with large sequence lengths, particularly if the naive attention implementation is not explicitly disabled. The project also presents exercises, indicating it may be experimental or a platform for ongoing research and development.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Mehdi Amini Mehdi Amini(Author of MLIR; Distinguished Engineer at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

flashinfer by flashinfer-ai

0.8%
6k
Kernel library for LLM serving
Created 2 years ago
Updated 20 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.3%
24k
Fast, memory-efficient attention implementation
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.