attention_sinks  by tomaarsen

LLM extension for fluent, endless text generation

created 1 year ago
702 stars

Top 49.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides an open-source implementation of "Attention Sinks," a technique to enable Large Language Models (LLMs) to generate fluent text indefinitely without retraining, using constant memory. It's ideal for multi-step LLM applications like chat assistants, maintaining performance and fluency over extended generation lengths.

How It Works

Attention Sinks modifies the attention mechanism to use a fixed number of "sink" tokens and a sliding window of recent tokens. This approach retains crucial context from the immediate past while discarding older tokens, preventing the quadratic memory growth of standard attention and the performance degradation seen with simple sliding windows. This allows for fluent, long-form generation with constant VRAM usage.

Quick Start & Requirements

  • Install via pip: pip install attention_sinks
  • Requires PyTorch and Hugging Face Transformers.
  • Supports various models including Llama, Mistral, Falcon, MPT, and more.
  • Official documentation and demo scripts are available in the repository.

Highlighted Details

  • Achieves fluent generation for over 10,000 tokens, outperforming standard transformers and simple windowed attention.
  • Maintains stable perplexity even after processing millions of tokens.
  • Enables models to recall information from hundreds of thousands of tokens back.
  • Offers a drop-in replacement API (attention_sinks.AutoModel) for seamless integration.

Maintenance & Community

  • Inspired by and adapted from StreamingLLM.
  • Contributions from the community have extended model support.
  • A changelog is available for release information.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The context window size remains fixed by the model's original pre-training; this method does not expand the inherent context window or long-term memory. While it improves fluency for recent tokens, it may not effectively summarize very long texts if the relevant information falls outside the retained window.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Starred by Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

streaming-llm by mit-han-lab

0.1%
7k
Framework for efficient LLM streaming
created 1 year ago
updated 1 year ago
Feedback? Help us improve.