Training-free method for extending LLM context windows
Top 69.9% on sourcepulse
This repository provides ChunkLlama, a training-free method for extending the context window of Large Language Models (LLMs) by over 8x. It targets researchers and practitioners seeking to improve LLM performance on long-context tasks without costly retraining. ChunkLlama integrates seamlessly with existing inference libraries and positional encoding methods, enabling significant context scaling for models like Llama-2/3 and Mistral.
How It Works
ChunkLlama implements a "dual chunk attention" mechanism. This approach divides the attention computation into local and global chunks, allowing the model to process significantly longer sequences than its original pre-training length. This method is advantageous as it requires no additional training, making it a highly efficient way to achieve long-context capabilities. It is compatible with popular extrapolation techniques like Positional Interpolation (PI) and NTK-Aware RoPE, and memory-efficient inference libraries such as FlashAttention and vLLM.
Quick Start & Requirements
pip install -e .
within the vllm
directory.transformers
, flash-attn
(>= 2.5.0, < 2.6.0). GPU with sufficient VRAM is recommended for longer contexts (e.g., 80GB A100 for 90k context with Llama2 7B).config.json
and integrating provided Python code snippets for inference.Highlighted Details
Maintenance & Community
The project is associated with HKUNLP and acknowledges contributions from Fei Huang. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
Limitations & Caveats
The data and weights are licensed for non-commercial, research-only use, posing a significant restriction for commercial applications. While 7B models can achieve low perplexity, they may struggle with practical tasks, recommending larger models (13B/70B) for higher accuracy.
9 months ago
1 day