EasyContext by jzhang38

Recipes for language model context length extrapolation to 1M tokens

Created 1 year ago

751 stars

Top 46.3% on SourcePulse

View on GitHub

7 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Jiaming Song

Chief Scientist at Luma AI

Misha Laskin

Cofounder of Reflection AI

Pawel Garbacki

Cofounder of Fireworks AI

and 3 more!

Project Summary

This repository provides memory optimization and training recipes to extend language models' context length to 1 million tokens with minimal hardware. It targets researchers and practitioners aiming to demystify and implement long-context capabilities without requiring proprietary infrastructure. The project demonstrates that achieving 700K context with Llama-2-7B on 8 A100s and 1M with Llama-2-13B on 16 A100s is feasible using existing techniques.

How It Works

The project combines several established techniques to enable efficient long-context training: Sequence parallelism, DeepSpeed ZeRO-3 offload, Flash Attention and its fused kernels, and activation checkpointing. It supports various sequence parallel methods including Ring Attention, Dist Flash Attention, and DeepSpeed Ulysses. This approach allows for full fine-tuning with full attention and full sequence length, avoiding approximations and demonstrating a straightforward path to scaling context windows.

Quick Start & Requirements

Install: pip install -r requirements.txt (after setting up a conda environment with Python 3.10 and PyTorch nightly with CUDA 11.8).
Prerequisites: Python 3.10, PyTorch 2.4.0 (nightly), CUDA 11.8, Ninja, packaging, Flash-attn (with --no-build-isolation --no-cache-dir).
Resources: Training 700K context on Llama-2-7B requires 8 A100 GPUs. Evaluation of 1M context takes approximately 6 hours on 8 A100s.
Links: Hugging Face, LongVA

Highlighted Details

Achieves 700K context with Llama-2-7B on 8 A100s and 1M context with Llama-2-13B on 16 A100s.
Demonstrates generalization to nearly 1M context from training at 512K sequence length.
Includes evaluation scripts for needle-in-a-haystack and perplexity benchmarks.
Training script is concise (<200 lines) and integrates with Hugging Face Transformers.

Maintenance & Community

The project is actively updated, with recent additions including Ulysses and distractors for evaluation. Community contributions and collaborations are welcomed via issues and pull requests.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it acknowledges and builds upon several open-source projects, including Ring-Attention and Flash-Attention. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The project notes that some "red bricks" remain in needle-in-a-haystack evaluations, suggesting potential improvements through instruction tuning or more extensive long-context training. PyTorch nightly builds are required due to memory issues with stable versions. There is no clear timeline for planned features like instruction tuning or Mistral model support.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days