AutoCompressors by princeton-nlp

Research paper adapting LMs for long context compression

Created 2 years ago

323 stars

Top 84.3% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Project Summary

This repository provides the official implementation for "Adapting Language Models to Compress Long Contexts," enabling language models to compress extensive context into summary vectors and reason over them. It targets researchers and practitioners working with long-context NLP tasks, offering a method to overcome context length limitations in transformer models.

How It Works

The core innovation is the "AutoCompressor" architecture, which integrates a context compression mechanism directly into the language model. This is achieved by training the model to generate a fixed-size set of "summary vectors" from segments of the input context. These summary vectors are then prepended to subsequent segments as soft prompts, allowing the model to retain and reason over information from much longer contexts than its native architecture would typically support. This approach avoids the quadratic complexity of standard attention mechanisms for long sequences.

Quick Start & Requirements

Install: pip install packaging transformers==4.34.0 datasets==2.13.4 accelerate==0.24.1 sentencepiece==0.1.99 flash-attn==2.3.5 wandb and pip install git+https://github.com/Dao-AILab/flash-attention.git#subdirectory=csrc/rotary.
Prerequisites: PyTorch 2.1.0+, CUDA 11.8+ (for flash-attn), bfloat16 support, and flash-attn installation with correct CUDA_HOME.
Resource Footprint: Requires GPU with sufficient VRAM for Llama-2-7b (e.g., 24GB+ for bfloat16).
Links: Hugging Face Models, Paper

Highlighted Details

Offers pre-trained AutoCompressors based on Llama-2-7b and OPT-2.7b/1.3b, supporting context lengths up to 30k tokens.
Utilizes Flash Attention for reduced memory requirements during training and inference.
Supports both explicit generation of summary vectors and implicit, multi-step compression for extremely long inputs.
Demonstrates significant performance gains in retaining context information compared to standard models.

Maintenance & Community

The project is associated with Princeton University NLP research. For questions or bugs, users can contact the authors via email or open an issue on GitHub.

Licensing & Compatibility

The repository code is likely under a permissive license (e.g., MIT, Apache 2.0), but the underlying base models (Llama-2, OPT) have their own licenses. Llama-2's license has restrictions on commercial use for very large companies. Compatibility with closed-source linking depends on the base model licenses.

Limitations & Caveats

Flash Attention requires specific CUDA versions and hardware, and its use with use_cache=True during evaluation might be unstable. The project relies on specific versions of libraries, potentially leading to compatibility issues with newer releases.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days