Recipes for language model context length extrapolation to 1M tokens
Top 47.9% on sourcepulse
This repository provides memory optimization and training recipes to extend language models' context length to 1 million tokens with minimal hardware. It targets researchers and practitioners aiming to demystify and implement long-context capabilities without requiring proprietary infrastructure. The project demonstrates that achieving 700K context with Llama-2-7B on 8 A100s and 1M with Llama-2-13B on 16 A100s is feasible using existing techniques.
How It Works
The project combines several established techniques to enable efficient long-context training: Sequence parallelism, DeepSpeed ZeRO-3 offload, Flash Attention and its fused kernels, and activation checkpointing. It supports various sequence parallel methods including Ring Attention, Dist Flash Attention, and DeepSpeed Ulysses. This approach allows for full fine-tuning with full attention and full sequence length, avoiding approximations and demonstrating a straightforward path to scaling context windows.
Quick Start & Requirements
pip install -r requirements.txt
(after setting up a conda environment with Python 3.10 and PyTorch nightly with CUDA 11.8).--no-build-isolation --no-cache-dir
).Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions including Ulysses and distractors for evaluation. Community contributions and collaborations are welcomed via issues and pull requests.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, it acknowledges and builds upon several open-source projects, including Ring-Attention and Flash-Attention. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The project notes that some "red bricks" remain in needle-in-a-haystack evaluations, suggesting potential improvements through instruction tuning or more extensive long-context training. PyTorch nightly builds are required due to memory issues with stable versions. There is no clear timeline for planned features like instruction tuning or Mistral model support.
10 months ago
1 day