native-sparse-attention-pytorch by lucidrains

Sparse attention implementation from Deepseek's research paper

Created 10 months ago

791 stars

Top 44.5% on SourcePulse

View on GitHub

1 Expert Loves This Project

Wing Lian

Founder of Axolotl AI

Project Summary

This repository provides a PyTorch implementation of the "Native Sparse Attention" pattern, designed to improve the efficiency of Transformer models. It targets researchers and engineers working with large language models or sequence processing tasks who need to reduce the quadratic complexity of standard self-attention.

How It Works

The implementation leverages a sparse attention mechanism that reduces computational complexity from quadratic to linear. It achieves this by strategically selecting and computing attention scores only for a subset of token pairs, rather than all possible pairs. This approach is advantageous for handling longer sequences more efficiently without significant performance degradation.

Quick Start & Requirements

Install: pip install native-sparse-attention-pytorch
Prerequisites: PyTorch. The example requires wandb.
Example: The README includes a usage example and instructions to run an Enwik8 language modeling experiment.

Highlighted Details

Implements the "Native Sparse Attention" pattern from the Deepseek team.
Offers configurable parameters for sparsity, including sliding_window_size, compress_block_size, and num_selected_blocks.
Includes an example for language modeling on Enwik8.
Cites relevant research papers on sparse attention and computational complexity.

Maintenance & Community

The project is maintained by lucidrains, with contributions acknowledged from Phil Tillet, Mr-Grin, and Eric Pasewark. No specific community channels (Discord, Slack) are listed.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the exact performance gains or benchmarks compared to standard attention or other sparse attention implementations. The absence of an explicit license is a significant caveat for adoption.

Health Check

Last Commit

4 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days