cuLA  by inclusionAI

CUDA kernels for efficient linear attention

Created 1 week ago

New!

419 stars

Top 70.1% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides high-performance CUDA kernels for linear attention variants, addressing the quadratic complexity of standard attention mechanisms in Large Language Models (LLMs). It targets researchers and engineers developing long-context LLM applications, offering significant speedups on modern NVIDIA GPUs by enabling linear-time state updates.

How It Works

cuLA implements hand-tuned CUDA kernels for linear attention variants like GLA, KDA, GDN, and Lightning Attention using the CuTe DSL and CUTLASS C++. This approach replaces computationally expensive quadratic pairwise interactions with efficient linear-time updates, making it suitable for processing extensive contexts. The kernels are specifically optimized for NVIDIA Blackwell (SM10X) and Hopper (SM90) GPU architectures.

Quick Start & Requirements

  • Installation: Clone the repository, initialize submodules (git submodule update --init --recursive), install PyTorch (pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu129), install flash-linear-attention (pip install -e third_party/flash-linear-attention), and finally install cuLA (pip install -e . --no-build-isolation).
  • Prerequisites: Python 3.12+, CUDA Toolkit 12.9+, NVCC 12.9+, PyTorch 2.9.1+ (matching CUDA version). Requires NVIDIA Hopper (SM90) or Blackwell (SM10X) GPUs.
  • Links: Repository: https://github.com/inclusionAI/cuLA. CuTe DSL: https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL.

Highlighted Details

  • Achieves significant speedups, e.g., KDA Modular Forward (Blackwell) averages 1.45x, Lightning Attention Prefill (Blackwell) up to 1.86x, and KDA Fused Forward (Hopper) averages 1.52x.
  • Designed as a drop-in submodule for flash-linear-attention (FLA), requiring only a one-line import change for integration.
  • Supports key linear attention variants including KDA (Kimi Delta Attention) and Lightning Attention.

Maintenance & Community

Contributions are actively welcomed for performance tuning, new algorithm support, and bug fixes. A Slack community is available for Q&A and discussion. The project roadmap indicates ongoing development towards full integration with FLA and further optimizations.

Licensing & Compatibility

The README does not explicitly state a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.

Limitations & Caveats

cuLA is in its "Early Stage" of development, with many kernels requiring further optimization and the API subject to evolution. CUDA kernel tuning is noted as labor-intensive, highlighting a potential bottleneck for rapid development or broad community contribution.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
27
Issues (30d)
21
Star History
420 stars in the last 10 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

SageAttention by thu-ml

0.5%
3k
Attention kernel for plug-and-play inference acceleration
Created 1 year ago
Updated 2 months ago
Starred by Mehdi Amini Mehdi Amini(Author of MLIR; Distinguished Engineer at NVIDIA), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
15 more.

flashinfer by flashinfer-ai

1.6%
5k
Kernel library for LLM serving
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
23k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 3 days ago
Feedback? Help us improve.