Discover and explore top open-source AI tools and projects—updated daily.
inclusionAICUDA kernels for efficient linear attention
New!
Top 70.1% on SourcePulse
This project provides high-performance CUDA kernels for linear attention variants, addressing the quadratic complexity of standard attention mechanisms in Large Language Models (LLMs). It targets researchers and engineers developing long-context LLM applications, offering significant speedups on modern NVIDIA GPUs by enabling linear-time state updates.
How It Works
cuLA implements hand-tuned CUDA kernels for linear attention variants like GLA, KDA, GDN, and Lightning Attention using the CuTe DSL and CUTLASS C++. This approach replaces computationally expensive quadratic pairwise interactions with efficient linear-time updates, making it suitable for processing extensive contexts. The kernels are specifically optimized for NVIDIA Blackwell (SM10X) and Hopper (SM90) GPU architectures.
Quick Start & Requirements
git submodule update --init --recursive), install PyTorch (pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu129), install flash-linear-attention (pip install -e third_party/flash-linear-attention), and finally install cuLA (pip install -e . --no-build-isolation).https://github.com/inclusionAI/cuLA. CuTe DSL: https://github.com/NVIDIA/cutlass/tree/main/python/CuTeDSL.Highlighted Details
flash-linear-attention (FLA), requiring only a one-line import change for integration.Maintenance & Community
Contributions are actively welcomed for performance tuning, new algorithm support, and bug fixes. A Slack community is available for Q&A and discussion. The project roadmap indicates ongoing development towards full integration with FLA and further optimizations.
Licensing & Compatibility
The README does not explicitly state a software license. This omission requires clarification for assessing commercial use or closed-source integration compatibility.
Limitations & Caveats
cuLA is in its "Early Stage" of development, with many kernels requiring further optimization and the API subject to evolution. CUDA kernel tuning is noted as labor-intensive, highlighting a potential bottleneck for rapid development or broad community contribution.
1 day ago
Inactive
microsoft
feifeibear
thu-ml
flashinfer-ai
Dao-AILab