streaming-llm by mit-han-lab

Framework for efficient LLM streaming

Created 2 years ago

7,166 stars

Top 7.2% on SourcePulse

View on GitHub

7 Experts Love This Project

Georgios Konstantopoulos

CTO, General Partner at Paradigm

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 3 more!

Project Summary

This project provides an efficient framework for deploying Large Language Models (LLMs) on infinite-length inputs, addressing memory consumption and generalization challenges. It's designed for researchers and developers working with streaming applications like multi-round dialogues, enabling LLMs to handle continuous interactions without performance degradation or extensive memory usage.

How It Works

StreamingLLM enables LLMs trained with finite context windows to generalize to infinite sequence lengths without fine-tuning. It leverages the "attention sink" phenomenon, where initial tokens act as sinks for attention scores, preserving performance. The framework caches recent tokens' Key and Value (KV) states while discarding intermediate ones, outperforming sliding window recomputation by up to 22.2x.

Quick Start & Requirements

Install: conda create -yn streaming python=3.8, conda activate streaming, pip install torch torchvision torchaudio, pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop
Prerequisites: Python 3.8, PyTorch, Hugging Face Transformers (v4.33.0), CUDA.
Run: CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py
Links: Paper, Slides, Video

Highlighted Details

Enables LLMs (Llama-2, MPT, Falcon, Pythia) to handle up to 4 million tokens.
Outperforms sliding window recomputation by up to 22.2x speedup.
Integrated into NVIDIA TensorRT-LLM and Hugging Face Transformers.
Placeholder token integration further improves streaming deployment.

Maintenance & Community

The project is associated with MIT Han Lab. Integrations by HPC-AI Tech, NVIDIA TensorRT-LLM, CMU, UW, OctoAI, and Hugging Face Transformers indicate active adoption and community interest.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The context window size remains unchanged; only recent tokens and attention sinks are retained, meaning the model cannot process or recall information beyond its original training sequence length. For summarization tasks with very long inputs, it may only focus on the concluding parts.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

29 stars in the last 30 days