cuLA by inclusionAI

CUDA kernels for efficient linear attention

Created 3 months ago

528 stars

Top 59.0% on SourcePulse

Project Summary

This project provides high-performance CUDA kernels for linear attention variants, addressing the quadratic complexity of standard attention mechanisms in Large Language Models (LLMs). It targets researchers and engineers developing long-context LLM applications, offering significant speedups on modern NVIDIA GPUs by enabling linear-time state updates.

How It Works

cuLA implements hand-tuned CUDA kernels for linear attention variants like GLA, KDA, GDN, and Lightning Attention using the CuTe DSL and CUTLASS C++. This approach replaces computationally expensive quadratic pairwise interactions with efficient linear-time updates, making it suitable for processing extensive contexts. The kernels are specifically optimized for NVIDIA Blackwell (SM10X) and Hopper (SM90) GPU architectures.

Quick Start & Requirements

Installation: Clone the repository, initialize submodules (git submodule update --init --recursive), install PyTorch (pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu129), install flash-linear-attention (pip install -e third_party/flash-linear-attention), and finally install cuLA (pip install -e . --no-build-isolation).
Prerequisites: Python 3.12+, CUDA Toolkit 12.9+, NVCC 12.9+, PyTorch 2.9.1+ (matching CUDA version). Requires NVIDIA Hopper (SM90) or Blackwell (SM10X) GPUs.
Links: Repository: https://github.com/inclusionAI/cuLA. CuTe DSL: https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL.

Highlighted Details

Achieves significant speedups, e.g., KDA Modular Forward (Blackwell) averages 1.45x, Lightning Attention Prefill (Blackwell) up to 1.86x, and KDA Fused Forward (Hopper) averages 1.52x.
Designed as a drop-in submodule for flash-linear-attention (FLA), requiring only a one-line import change for integration.
Supports key linear attention variants including KDA (Kimi Delta Attention) and Lightning Attention.

Maintenance & Community

Contributions are actively welcomed for performance tuning, new algorithm support, and bug fixes. A Slack community is available for Q&A and discussion. The project roadmap indicates ongoing development towards full integration with FLA and further optimizations.

Licensing & Compatibility

The README does not explicitly state a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

cuLA is in its "Early Stage" of development, with many kernels requiring further optimization and the API subject to evolution. CUDA kernel tuning is noted as labor-intensive, highlighting a potential bottleneck for rapid development or broad community contribution.

cuLA by inclusionAI

Explore Similar Projects

MoDA by hustvl

Flash-Sparse-Attention by Relaxed-System-Lab

native-sparse-attention-triton by XunhaoLai

triton-flash-attention by hkproj

MSA by MiniMax-AI

FlashQLA by QwenLM

Block-Sparse-Attention by mit-han-lab

Kimi-Linear by MoonshotAI

MInference by microsoft

SageAttention by thu-ml

flashinfer by flashinfer-ai

flash-attention by Dao-AILab