Discover and explore top open-source AI tools and projects—updated daily.
jzhang38Recipes for language model context length extrapolation to 1M tokens
Top 46.3% on SourcePulse
This repository provides memory optimization and training recipes to extend language models' context length to 1 million tokens with minimal hardware. It targets researchers and practitioners aiming to demystify and implement long-context capabilities without requiring proprietary infrastructure. The project demonstrates that achieving 700K context with Llama-2-7B on 8 A100s and 1M with Llama-2-13B on 16 A100s is feasible using existing techniques.
How It Works
The project combines several established techniques to enable efficient long-context training: Sequence parallelism, DeepSpeed ZeRO-3 offload, Flash Attention and its fused kernels, and activation checkpointing. It supports various sequence parallel methods including Ring Attention, Dist Flash Attention, and DeepSpeed Ulysses. This approach allows for full fine-tuning with full attention and full sequence length, avoiding approximations and demonstrating a straightforward path to scaling context windows.
Quick Start & Requirements
pip install -r requirements.txt (after setting up a conda environment with Python 3.10 and PyTorch nightly with CUDA 11.8).--no-build-isolation --no-cache-dir).Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions including Ulysses and distractors for evaluation. Community contributions and collaborations are welcomed via issues and pull requests.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, it acknowledges and builds upon several open-source projects, including Ring-Attention and Flash-Attention. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.
Limitations & Caveats
The project notes that some "red bricks" remain in needle-in-a-haystack evaluations, suggesting potential improvements through instruction tuning or more extensive long-context training. PyTorch nightly builds are required due to memory issues with stable versions. There is no clear timeline for planned features like instruction tuning or Mistral model support.
1 year ago
Inactive
feifeibear
bigcode-project
deepseek-ai
zai-org