Sparse attention primitives for research
Top 26.9% on sourcepulse
This repository provides sparse attention primitives for efficiently processing long sequences in Transformer models, targeting researchers and engineers working on large-scale language generation. It offers optimized attention kernels that reduce computational complexity by skipping unnecessary calculations, enabling faster training and inference.
How It Works
The core of the project lies in fused implementations of attention operations that support block sparsity. Instead of computing the full attention matrix, it allows users to define patterns of blocks to be skipped (set to zero) in the QK^T product and softmax calculation. This approach, detailed in the Sparse Transformers paper, significantly reduces computation by avoiding unnecessary operations, especially for long sequences.
Quick Start & Requirements
pip install blocksparse
(requires CUDA 10 and tensorflow-gpu).python attention.py
(non-V100) or python attention.py fp16
(V100).Highlighted Details
Maintenance & Community
Status: Archive (code provided as-is, no updates expected).
Licensing & Compatibility
The repository does not explicitly state a license. The associated paper and blog post are from OpenAI.
Limitations & Caveats
The project is archived and no longer maintained. The primary dependency, blocksparse
, may require installation from source depending on the CUDA and TensorFlow setup. FP16 support is restricted to specific GPU hardware.
5 years ago
Inactive