landmark-attention by epfml

Research paper implementation for random-access infinite context Transformers

Created 2 years ago

426 stars

Top 69.3% on SourcePulse

Project Summary

This repository provides an implementation of Landmark Attention, a technique designed to enable Transformers to handle infinitely long context lengths by randomly accessing past information. It is targeted at researchers and engineers working with large language models who need to overcome the quadratic complexity of standard attention mechanisms. The primary benefit is the ability to process significantly longer sequences with reduced computational and memory overhead.

How It Works

Landmark Attention introduces "landmark" tokens at regular intervals within the input sequence. These landmarks act as memory checkpoints, allowing the model to efficiently retrieve relevant past information without attending to every token. The implementation includes a high-level approach and a fused Triton implementation combined with Flash Attention for improved performance and reduced memory usage. This approach is advantageous as it maintains a linear or near-linear complexity with respect to sequence length, unlike the quadratic complexity of standard attention.

Quick Start & Requirements

Installation: A install_deps.sh script is provided for installing dependencies, particularly for the Triton implementation which has PyTorch version conflicts.
Prerequisites: Python, PyTorch, Triton (specific version required), and potentially CUDA. LLaMA fine-tuning requires LLaMA weights in Hugging Face format.
Resources: Training examples involve large datasets like PG19 and arXiv Math. Fine-tuning LLaMA 7B with 2048 context length is demonstrated.
Links: Paper: https://arxiv.org/abs/2305.16300

Highlighted Details

Fused Triton implementation with Flash Attention significantly reduces memory usage and increases performance.
Enables training LLaMA 7B with 2048 context length, a substantial increase from the standard 512.
Released weight diff for LLaMA 7B fine-tuned on RedPajama with landmark attention.
Supports language modeling benchmarks on PG19 and arXiv Math datasets.

Maintenance & Community

The project is associated with Amirkeivan Mohtashami and Martin Jaggi. The README mentions ongoing work to address outdated component names within the code.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README.

Limitations & Caveats

The fused Triton implementation assumes landmark blocks are of the same size as Flash Attention blocks, limiting maximum block size to fit within GPU local memory. It also assumes the difference in key/query counts is a multiple of the block size, requiring standard attention for auto-regressive generation token-by-token. The high-level implementation allows flexible landmark placement, but the fused version assumes regular placement at the end of each block.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days