attention_sinks by tomaarsen

LLM extension for fluent, endless text generation

Created 2 years ago

734 stars

Top 47.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Thomas Wolf

Cofounder of Hugging Face

Omar Sanseviero

DevRel at Google DeepMind

Lysandre Debut

Chief Open-Source Officer at Hugging Face

Project Summary

This repository provides an open-source implementation of "Attention Sinks," a technique to enable Large Language Models (LLMs) to generate fluent text indefinitely without retraining, using constant memory. It's ideal for multi-step LLM applications like chat assistants, maintaining performance and fluency over extended generation lengths.

How It Works

Attention Sinks modifies the attention mechanism to use a fixed number of "sink" tokens and a sliding window of recent tokens. This approach retains crucial context from the immediate past while discarding older tokens, preventing the quadratic memory growth of standard attention and the performance degradation seen with simple sliding windows. This allows for fluent, long-form generation with constant VRAM usage.

Quick Start & Requirements

Install via pip: pip install attention_sinks
Requires PyTorch and Hugging Face Transformers.
Supports various models including Llama, Mistral, Falcon, MPT, and more.
Official documentation and demo scripts are available in the repository.

Highlighted Details

Achieves fluent generation for over 10,000 tokens, outperforming standard transformers and simple windowed attention.
Maintains stable perplexity even after processing millions of tokens.
Enables models to recall information from hundreds of thousands of tokens back.
Offers a drop-in replacement API (attention_sinks.AutoModel) for seamless integration.

Maintenance & Community

Inspired by and adapted from StreamingLLM.
Contributions from the community have extended model support.
A changelog is available for release information.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial or closed-source use.

Limitations & Caveats

The context window size remains fixed by the model's original pre-training; this method does not expand the inherent context window or long-term memory. While it improves fluency for recent tokens, it may not effectively summarize very long texts if the relevant information falls outside the retained window.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days