Framework for efficient LLM streaming
Top 7.5% on sourcepulse
This project provides an efficient framework for deploying Large Language Models (LLMs) on infinite-length inputs, addressing memory consumption and generalization challenges. It's designed for researchers and developers working with streaming applications like multi-round dialogues, enabling LLMs to handle continuous interactions without performance degradation or extensive memory usage.
How It Works
StreamingLLM enables LLMs trained with finite context windows to generalize to infinite sequence lengths without fine-tuning. It leverages the "attention sink" phenomenon, where initial tokens act as sinks for attention scores, preserving performance. The framework caches recent tokens' Key and Value (KV) states while discarding intermediate ones, outperforming sliding window recomputation by up to 22.2x.
Quick Start & Requirements
conda create -yn streaming python=3.8
, conda activate streaming
, pip install torch torchvision torchaudio
, pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop
CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py
Highlighted Details
Maintenance & Community
The project is associated with MIT Han Lab. Integrations by HPC-AI Tech, NVIDIA TensorRT-LLM, CMU, UW, OctoAI, and Hugging Face Transformers indicate active adoption and community interest.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The context window size remains unchanged; only recent tokens and attention sinks are retained, meaning the model cannot process or recall information beyond its original training sequence length. For summarization tasks with very long inputs, it may only focus on the concluding parts.
1 year ago
1 day