Research paper implementation for random-access infinite context Transformers
Top 70.7% on sourcepulse
This repository provides an implementation of Landmark Attention, a technique designed to enable Transformers to handle infinitely long context lengths by randomly accessing past information. It is targeted at researchers and engineers working with large language models who need to overcome the quadratic complexity of standard attention mechanisms. The primary benefit is the ability to process significantly longer sequences with reduced computational and memory overhead.
How It Works
Landmark Attention introduces "landmark" tokens at regular intervals within the input sequence. These landmarks act as memory checkpoints, allowing the model to efficiently retrieve relevant past information without attending to every token. The implementation includes a high-level approach and a fused Triton implementation combined with Flash Attention for improved performance and reduced memory usage. This approach is advantageous as it maintains a linear or near-linear complexity with respect to sequence length, unlike the quadratic complexity of standard attention.
Quick Start & Requirements
install_deps.sh
script is provided for installing dependencies, particularly for the Triton implementation which has PyTorch version conflicts.Highlighted Details
Maintenance & Community
The project is associated with Amirkeivan Mohtashami and Martin Jaggi. The README mentions ongoing work to address outdated component names within the code.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README.
Limitations & Caveats
The fused Triton implementation assumes landmark blocks are of the same size as Flash Attention blocks, limiting maximum block size to fit within GPU local memory. It also assumes the difference in key/query counts is a multiple of the block size, requiring standard attention for auto-regressive generation token-by-token. The high-level implementation allows flexible landmark placement, but the fused version assumes regular placement at the end of each block.
1 year ago
1 day