Star-Attention by NVIDIA

PyTorch code for efficient LLM inference on long sequences

Created 1 year ago

394 stars

Top 73.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

This repository provides an implementation of Star Attention, a novel block-sparse attention mechanism designed for efficient inference of Large Language Models (LLMs) over long sequences. It targets researchers and engineers working with LLMs who need to process extended contexts without significant accuracy degradation or computational overhead. Star Attention offers up to 11x speedup with minimal accuracy loss, making it suitable for applications requiring long-context understanding.

How It Works

Star Attention operates in two phases to optimize attention computation. Phase 1, Context Encoding, segments context into blocks, each prefixed with an anchor block, and processes these augmented blocks in parallel across distributed hosts. Phase 2, Query Processing and Token Generation, involves a query host gathering local attention outputs and softmax denominators from all hosts to compute a global attention output. This approach minimizes communication overhead by only exchanging local attention results and a single scalar (sum of exponents) per host.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt
Download NLTK tokenizer: import nltk; nltk.download('punkt_tab')
Download datasets (Paul Graham Essays, SQuAD, HotpotQA) via bash ruler/download_data.sh.
Download models using scripts/download_hf_model.py.
Run inference on RULER: python run_ruler.py -n <experiment_name> -p <path_to_model> -pc <prompt_template_type> -a star -bs <context_block_size> -l <list_of_sequence_lengths_to_run_inference> -np <num_parallel_processes_per_node> --output_dir <output_directory>
Supports multi-node inference with -nn <num_nodes>.
Official documentation and examples are available within the repository.

Highlighted Details

Achieves up to 11x inference speedup with 97-100% accuracy preservation compared to global attention.
Compatible with existing Transformer LLMs trained with global attention, requiring no additional training.
Orthogonal to other optimizations like Flash Attention and KV cache compression.
Implemented in PyTorch using HuggingFace Transformers.

Maintenance & Community

Developed by NVIDIA.
Issues can be raised in the repository for support or bug reporting.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify any limitations or known bugs. The project appears to be research-oriented, and its stability and long-term maintenance are not detailed.

Star-Attention by NVIDIA

Explore Similar Projects

long-llms-learning by Strivin0311

native-sparse-attention-triton by XunhaoLai

FastV by pkunlp-icler

duo-attention by mit-han-lab

Quest by mit-han-lab

Block-Sparse-Attention by mit-han-lab

gated_attention by qiuzh20

Kimi-Linear by MoonshotAI

MInference by microsoft

bigbird by google-research

DeepSeek-V3.2-Exp by deepseek-ai

MiniMax-01 by MiniMax-AI