native-sparse-attention-pytorch  by lucidrains

Sparse attention implementation from Deepseek's research paper

created 5 months ago
684 stars

Top 50.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a PyTorch implementation of the "Native Sparse Attention" pattern, designed to improve the efficiency of Transformer models. It targets researchers and engineers working with large language models or sequence processing tasks who need to reduce the quadratic complexity of standard self-attention.

How It Works

The implementation leverages a sparse attention mechanism that reduces computational complexity from quadratic to linear. It achieves this by strategically selecting and computing attention scores only for a subset of token pairs, rather than all possible pairs. This approach is advantageous for handling longer sequences more efficiently without significant performance degradation.

Quick Start & Requirements

  • Install: pip install native-sparse-attention-pytorch
  • Prerequisites: PyTorch. The example requires wandb.
  • Example: The README includes a usage example and instructions to run an Enwik8 language modeling experiment.

Highlighted Details

  • Implements the "Native Sparse Attention" pattern from the Deepseek team.
  • Offers configurable parameters for sparsity, including sliding_window_size, compress_block_size, and num_selected_blocks.
  • Includes an example for language modeling on Enwik8.
  • Cites relevant research papers on sparse attention and computational complexity.

Maintenance & Community

The project is maintained by lucidrains, with contributions acknowledged from Phil Tillet, Mr-Grin, and Eric Pasewark. No specific community channels (Discord, Slack) are listed.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not specify the exact performance gains or benchmarks compared to standard attention or other sparse attention implementations. The absence of an explicit license is a significant caveat for adoption.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
0
Star History
96 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.