EasyContext  by jzhang38

Recipes for language model context length extrapolation to 1M tokens

created 1 year ago
739 stars

Top 47.9% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides memory optimization and training recipes to extend language models' context length to 1 million tokens with minimal hardware. It targets researchers and practitioners aiming to demystify and implement long-context capabilities without requiring proprietary infrastructure. The project demonstrates that achieving 700K context with Llama-2-7B on 8 A100s and 1M with Llama-2-13B on 16 A100s is feasible using existing techniques.

How It Works

The project combines several established techniques to enable efficient long-context training: Sequence parallelism, DeepSpeed ZeRO-3 offload, Flash Attention and its fused kernels, and activation checkpointing. It supports various sequence parallel methods including Ring Attention, Dist Flash Attention, and DeepSpeed Ulysses. This approach allows for full fine-tuning with full attention and full sequence length, avoiding approximations and demonstrating a straightforward path to scaling context windows.

Quick Start & Requirements

  • Install: pip install -r requirements.txt (after setting up a conda environment with Python 3.10 and PyTorch nightly with CUDA 11.8).
  • Prerequisites: Python 3.10, PyTorch 2.4.0 (nightly), CUDA 11.8, Ninja, packaging, Flash-attn (with --no-build-isolation --no-cache-dir).
  • Resources: Training 700K context on Llama-2-7B requires 8 A100 GPUs. Evaluation of 1M context takes approximately 6 hours on 8 A100s.
  • Links: Hugging Face, LongVA

Highlighted Details

  • Achieves 700K context with Llama-2-7B on 8 A100s and 1M context with Llama-2-13B on 16 A100s.
  • Demonstrates generalization to nearly 1M context from training at 512K sequence length.
  • Includes evaluation scripts for needle-in-a-haystack and perplexity benchmarks.
  • Training script is concise (<200 lines) and integrates with Hugging Face Transformers.

Maintenance & Community

The project is actively updated, with recent additions including Ulysses and distractors for evaluation. Community contributions and collaborations are welcomed via issues and pull requests.

Licensing & Compatibility

The repository does not explicitly state a license in the README. However, it acknowledges and builds upon several open-source projects, including Ring-Attention and Flash-Attention. Compatibility for commercial use or closed-source linking would require clarification of the licensing terms.

Limitations & Caveats

The project notes that some "red bricks" remain in needle-in-a-haystack evaluations, suggesting potential improvements through instruction tuning or more extensive long-context training. PyTorch nightly builds are required due to memory issues with stable versions. There is no clear timeline for planned features like instruction tuning or Mistral model support.

Health Check
Last commit

10 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
21 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Georgios Konstantopoulos Georgios Konstantopoulos(CTO, General Partner at Paradigm).

LongLoRA by dvlab-research

0.1%
3k
LongLoRA: Efficient fine-tuning for long-context LLMs
created 1 year ago
updated 11 months ago
Feedback? Help us improve.