triton-flash-attention by hkproj

Triton implementation of Flash Attention 2

Created 1 year ago

256 stars

Top 98.5% on SourcePulse

Project Summary

Summary

This repository provides an implementation of the Flash Attention 2 algorithm using Triton, targeting deep learning practitioners seeking to optimize the computationally intensive attention mechanism. It addresses the memory and speed bottlenecks associated with large sequence lengths in transformer models, offering a performant alternative for researchers and engineers working with extensive datasets or complex models. The primary benefit is enabling faster training and inference with reduced memory overhead.

How It Works

The project leverages Triton, a Python-based language and compiler for writing custom GPU kernels, to achieve high performance for the Flash Attention 2 algorithm. It is based on OpenAI's Fused Attention implementation, focusing on optimizing the attention computation. A key design choice is the avoidance of materializing the full SEQ_LEN x SEQ_LEN attention matrix, which is a significant memory bottleneck in standard implementations. This approach allows the algorithm to scale efficiently to much longer sequence lengths, pushing hardware limits.

Quick Start & Requirements

Installation involves installing dependencies from triton/requirements.txt. Users must carefully configure parameters such as BATCH_SIZE, NUM_HEADS, SEQ_LEN, and HEAD_DIM to align with their hardware capabilities and prevent resource exhaustion. The project includes CUDA examples demonstrating its usage.

Highlighted Details

Implements the Flash Attention 2 algorithm, known for its efficiency gains.
Utilizes Triton for high-performance, custom GPU kernel development.
Includes CUDA examples for practical application.
Features open-ended exercises for further optimization, including autotuning the backward pass and enhancing causal attention efficiency by avoiding unnecessary computations.

Maintenance & Community

The provided README does not contain information regarding specific contributors, community channels (like Discord or Slack), sponsorships, or a public roadmap.

Licensing & Compatibility

The license under which this project is distributed is not specified in the README. Consequently, compatibility for commercial use or integration into closed-source projects cannot be determined without further clarification.

Limitations & Caveats

This implementation has not been tested on AMD hardware. Users must be mindful of potential memory constraints when dealing with large sequence lengths, particularly if the naive attention implementation is not explicitly disabled. The project also presents exercises, indicating it may be experimental or a platform for ongoing research and development.

triton-flash-attention by hkproj

Explore Similar Projects

MoDA by hustvl

native-sparse-attention-triton by XunhaoLai

Flowformer by thuml

lmms-engine by EvolvingLMMs-Lab

native-sparse-attention-pytorch by lucidrains

flash-sparse-attention by HKUSTDial

cuLA by inclusionAI

Kimi-Linear by MoonshotAI

long-context-attention by feifeibear

awesome-fast-attention by Separius

flashinfer by flashinfer-ai

flash-attention by Dao-AILab