streaming-llm  by mit-han-lab

Framework for efficient LLM streaming

Created 2 years ago
7,042 stars

Top 7.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides an efficient framework for deploying Large Language Models (LLMs) on infinite-length inputs, addressing memory consumption and generalization challenges. It's designed for researchers and developers working with streaming applications like multi-round dialogues, enabling LLMs to handle continuous interactions without performance degradation or extensive memory usage.

How It Works

StreamingLLM enables LLMs trained with finite context windows to generalize to infinite sequence lengths without fine-tuning. It leverages the "attention sink" phenomenon, where initial tokens act as sinks for attention scores, preserving performance. The framework caches recent tokens' Key and Value (KV) states while discarding intermediate ones, outperforming sliding window recomputation by up to 22.2x.

Quick Start & Requirements

  • Install: conda create -yn streaming python=3.8, conda activate streaming, pip install torch torchvision torchaudio, pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop
  • Prerequisites: Python 3.8, PyTorch, Hugging Face Transformers (v4.33.0), CUDA.
  • Run: CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py
  • Links: Paper, Slides, Video

Highlighted Details

  • Enables LLMs (Llama-2, MPT, Falcon, Pythia) to handle up to 4 million tokens.
  • Outperforms sliding window recomputation by up to 22.2x speedup.
  • Integrated into NVIDIA TensorRT-LLM and Hugging Face Transformers.
  • Placeholder token integration further improves streaming deployment.

Maintenance & Community

The project is associated with MIT Han Lab. Integrations by HPC-AI Tech, NVIDIA TensorRT-LLM, CMU, UW, OctoAI, and Hugging Face Transformers indicate active adoption and community interest.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The context window size remains unchanged; only recent tokens and attention sinks are retained, meaning the model cannot process or recall information beyond its original training sequence length. For summarization tasks with very long inputs, it may only focus on the concluding parts.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
27 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Ying Sheng Ying Sheng(Coauthor of SGLang), and
2 more.

LookaheadDecoding by hao-ai-lab

0.2%
1k
Parallel decoding algorithm for faster LLM inference
Created 1 year ago
Updated 6 months ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.