Star-Attention  by NVIDIA

PyTorch code for efficient LLM inference on long sequences

created 8 months ago
385 stars

Top 75.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an implementation of Star Attention, a novel block-sparse attention mechanism designed for efficient inference of Large Language Models (LLMs) over long sequences. It targets researchers and engineers working with LLMs who need to process extended contexts without significant accuracy degradation or computational overhead. Star Attention offers up to 11x speedup with minimal accuracy loss, making it suitable for applications requiring long-context understanding.

How It Works

Star Attention operates in two phases to optimize attention computation. Phase 1, Context Encoding, segments context into blocks, each prefixed with an anchor block, and processes these augmented blocks in parallel across distributed hosts. Phase 2, Query Processing and Token Generation, involves a query host gathering local attention outputs and softmax denominators from all hosts to compute a global attention output. This approach minimizes communication overhead by only exchanging local attention results and a single scalar (sum of exponents) per host.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt
  • Download NLTK tokenizer: import nltk; nltk.download('punkt_tab')
  • Download datasets (Paul Graham Essays, SQuAD, HotpotQA) via bash ruler/download_data.sh.
  • Download models using scripts/download_hf_model.py.
  • Run inference on RULER: python run_ruler.py -n <experiment_name> -p <path_to_model> -pc <prompt_template_type> -a star -bs <context_block_size> -l <list_of_sequence_lengths_to_run_inference> -np <num_parallel_processes_per_node> --output_dir <output_directory>
  • Supports multi-node inference with -nn <num_nodes>.
  • Official documentation and examples are available within the repository.

Highlighted Details

  • Achieves up to 11x inference speedup with 97-100% accuracy preservation compared to global attention.
  • Compatible with existing Transformer LLMs trained with global attention, requiring no additional training.
  • Orthogonal to other optimizations like Flash Attention and KV cache compression.
  • Implemented in PyTorch using HuggingFace Transformers.

Maintenance & Community

  • Developed by NVIDIA.
  • Issues can be raised in the repository for support or bug reporting.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify any limitations or known bugs. The project appears to be research-oriented, and its stability and long-term maintenance are not detailed.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.