Efficient Torch/Triton implementations for linear attention models
Top 16.5% on sourcepulse
This repository provides efficient Triton-based implementations of state-of-the-art linear attention models for researchers and developers working with large language models. It offers optimized kernels and model integrations for various linear attention architectures, aiming to improve training and inference speed.
How It Works
The project leverages Python, PyTorch, and Triton to implement hardware-efficient kernels for linear attention mechanisms. It focuses on optimizing computations like fused operations (e.g., norm layers with gating, linear layers with cross-entropy loss) and parallelization strategies to reduce memory usage and increase throughput. The use of Triton allows for fine-grained control over GPU execution, enabling significant performance gains over standard PyTorch implementations.
Quick Start & Requirements
pip install --no-use-pep517 flash-linear-attention
or install from source for the latest features.Highlighted Details
lm-evaluation-harness
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 day ago
1 day