streaming-llm  by mit-han-lab

Framework for efficient LLM streaming

created 1 year ago
6,948 stars

Top 7.5% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an efficient framework for deploying Large Language Models (LLMs) on infinite-length inputs, addressing memory consumption and generalization challenges. It's designed for researchers and developers working with streaming applications like multi-round dialogues, enabling LLMs to handle continuous interactions without performance degradation or extensive memory usage.

How It Works

StreamingLLM enables LLMs trained with finite context windows to generalize to infinite sequence lengths without fine-tuning. It leverages the "attention sink" phenomenon, where initial tokens act as sinks for attention scores, preserving performance. The framework caches recent tokens' Key and Value (KV) states while discarding intermediate ones, outperforming sliding window recomputation by up to 22.2x.

Quick Start & Requirements

  • Install: conda create -yn streaming python=3.8, conda activate streaming, pip install torch torchvision torchaudio, pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop
  • Prerequisites: Python 3.8, PyTorch, Hugging Face Transformers (v4.33.0), CUDA.
  • Run: CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py
  • Links: Paper, Slides, Video

Highlighted Details

  • Enables LLMs (Llama-2, MPT, Falcon, Pythia) to handle up to 4 million tokens.
  • Outperforms sliding window recomputation by up to 22.2x speedup.
  • Integrated into NVIDIA TensorRT-LLM and Hugging Face Transformers.
  • Placeholder token integration further improves streaming deployment.

Maintenance & Community

The project is associated with MIT Han Lab. Integrations by HPC-AI Tech, NVIDIA TensorRT-LLM, CMU, UW, OctoAI, and Hugging Face Transformers indicate active adoption and community interest.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The context window size remains unchanged; only recent tokens and attention sinks are retained, meaning the model cannot process or recall information beyond its original training sequence length. For summarization tasks with very long inputs, it may only focus on the concluding parts.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
107 stars in the last 90 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Author of SGLang) and Jared Palmer Jared Palmer(Ex-VP of AI at Vercel; Founder of Turborepo; Author of Formik, TSDX).

xgen by salesforce

0%
720
LLM research release with 8k sequence length
created 2 years ago
updated 6 months ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.