AutoCompressors  by princeton-nlp

Research paper adapting LMs for long context compression

created 2 years ago
309 stars

Top 88.0% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the official implementation for "Adapting Language Models to Compress Long Contexts," enabling language models to compress extensive context into summary vectors and reason over them. It targets researchers and practitioners working with long-context NLP tasks, offering a method to overcome context length limitations in transformer models.

How It Works

The core innovation is the "AutoCompressor" architecture, which integrates a context compression mechanism directly into the language model. This is achieved by training the model to generate a fixed-size set of "summary vectors" from segments of the input context. These summary vectors are then prepended to subsequent segments as soft prompts, allowing the model to retain and reason over information from much longer contexts than its native architecture would typically support. This approach avoids the quadratic complexity of standard attention mechanisms for long sequences.

Quick Start & Requirements

  • Install: pip install packaging transformers==4.34.0 datasets==2.13.4 accelerate==0.24.1 sentencepiece==0.1.99 flash-attn==2.3.5 wandb and pip install git+https://github.com/Dao-AILab/flash-attention.git#subdirectory=csrc/rotary.
  • Prerequisites: PyTorch 2.1.0+, CUDA 11.8+ (for flash-attn), bfloat16 support, and flash-attn installation with correct CUDA_HOME.
  • Resource Footprint: Requires GPU with sufficient VRAM for Llama-2-7b (e.g., 24GB+ for bfloat16).
  • Links: Hugging Face Models, Paper

Highlighted Details

  • Offers pre-trained AutoCompressors based on Llama-2-7b and OPT-2.7b/1.3b, supporting context lengths up to 30k tokens.
  • Utilizes Flash Attention for reduced memory requirements during training and inference.
  • Supports both explicit generation of summary vectors and implicit, multi-step compression for extremely long inputs.
  • Demonstrates significant performance gains in retaining context information compared to standard models.

Maintenance & Community

The project is associated with Princeton University NLP research. For questions or bugs, users can contact the authors via email or open an issue on GitHub.

Licensing & Compatibility

The repository code is likely under a permissive license (e.g., MIT, Apache 2.0), but the underlying base models (Llama-2, OPT) have their own licenses. Llama-2's license has restrictions on commercial use for very large companies. Compatibility with closed-source linking depends on the base model licenses.

Limitations & Caveats

Flash Attention requires specific CUDA versions and hardware, and its use with use_cache=True during evaluation might be unstable. The project relies on specific versions of libraries, potentially leading to compatibility issues with newer releases.

Health Check
Last commit

10 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
6 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.