PyTorch code for efficient LLM inference on long sequences
Top 75.5% on sourcepulse
This repository provides an implementation of Star Attention, a novel block-sparse attention mechanism designed for efficient inference of Large Language Models (LLMs) over long sequences. It targets researchers and engineers working with LLMs who need to process extended contexts without significant accuracy degradation or computational overhead. Star Attention offers up to 11x speedup with minimal accuracy loss, making it suitable for applications requiring long-context understanding.
How It Works
Star Attention operates in two phases to optimize attention computation. Phase 1, Context Encoding, segments context into blocks, each prefixed with an anchor block, and processes these augmented blocks in parallel across distributed hosts. Phase 2, Query Processing and Token Generation, involves a query host gathering local attention outputs and softmax denominators from all hosts to compute a global attention output. This approach minimizes communication overhead by only exchanging local attention results and a single scalar (sum of exponents) per host.
Quick Start & Requirements
pip install -r requirements.txt
import nltk; nltk.download('punkt_tab')
bash ruler/download_data.sh
.scripts/download_hf_model.py
.python run_ruler.py -n <experiment_name> -p <path_to_model> -pc <prompt_template_type> -a star -bs <context_block_size> -l <list_of_sequence_lengths_to_run_inference> -np <num_parallel_processes_per_node> --output_dir <output_directory>
-nn <num_nodes>
.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify any limitations or known bugs. The project appears to be research-oriented, and its stability and long-term maintenance are not detailed.
1 month ago
1 day