ChunkLlama  by HKUNLP

Training-free method for extending LLM context windows

created 1 year ago
431 stars

Top 69.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides ChunkLlama, a training-free method for extending the context window of Large Language Models (LLMs) by over 8x. It targets researchers and practitioners seeking to improve LLM performance on long-context tasks without costly retraining. ChunkLlama integrates seamlessly with existing inference libraries and positional encoding methods, enabling significant context scaling for models like Llama-2/3 and Mistral.

How It Works

ChunkLlama implements a "dual chunk attention" mechanism. This approach divides the attention computation into local and global chunks, allowing the model to process significantly longer sequences than its original pre-training length. This method is advantageous as it requires no additional training, making it a highly efficient way to achieve long-context capabilities. It is compatible with popular extrapolation techniques like Positional Interpolation (PI) and NTK-Aware RoPE, and memory-efficient inference libraries such as FlashAttention and vLLM.

Quick Start & Requirements

  • Installation: Editable install via pip install -e . within the vllm directory.
  • Prerequisites: Python, transformers, flash-attn (>= 2.5.0, < 2.6.0). GPU with sufficient VRAM is recommended for longer contexts (e.g., 80GB A100 for 90k context with Llama2 7B).
  • Setup: Requires modifying model config.json and integrating provided Python code snippets for inference.
  • Links: Official Quick Start, Flash Decoding

Highlighted Details

  • Achieves 100k context length for Llama-2/3 70B, outperforming standard models.
  • Supports inference for Qwen-2, Llama-2/3 with vLLM.
  • Demonstrates strong performance on PG19 perplexity and retrieval tasks, with ChunkLlama3-70B matching GPT-4 on some benchmarks.
  • Offers fine-tuning scripts for further performance improvements on long conversations.

Maintenance & Community

The project is associated with HKUNLP and acknowledges contributions from Fei Huang. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

  • Code: Apache License 2.0.
  • Data & Weights: CC-BY-NC 4.0 License, strictly for research and non-commercial use. Models trained using the dataset are restricted to research purposes.

Limitations & Caveats

The data and weights are licensed for non-commercial, research-only use, posing a significant restriction for commercial applications. While 7B models can achieve low perplexity, they may struggle with practical tasks, recommending larger models (13B/70B) for higher accuracy.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
1 more.

yarn by jquesnelle

1.0%
2k
Context window extension method for LLMs (research paper, models)
created 2 years ago
updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.