Long-Context-Data-Engineering  by FranxYao

Research paper implementation for long-context data engineering

created 1 year ago
467 stars

Top 66.0% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides the data engineering implementation for scaling language models to 128K context, as detailed in the paper "Data Engineering for Scaling Language Models to 128K Context." It enables researchers and practitioners to replicate the long-context retrieval performance of models like GPT-4 by offering pre-trained checkpoints and data processing tools.

How It Works

The project focuses on adapting existing LLaMA models (7B and 13B) through continued pre-training on a custom-processed dataset designed for long contexts. It utilizes a modified Rotary Positional Embedding (RoPE) scaling technique and a custom tensor parallelism implementation for efficient inference on long sequences. The data processing involves upsampling sequences from the SlimPajama dataset to create chunks of 131072 tokens.

Quick Start & Requirements

  • Install dependencies: pip install -r requirements.txt (PyTorch assumed pre-installed).
  • Download checkpoints: Use provided Python snippets to download yaofu/llama-2-7b-80k and yaofu/llama-2-13b-64k.
  • Hardware: Requires significant GPU resources (e.g., 8x 4090 for 80K context, 4x 80G A100 for 128K context). Evaluation requires 4x 80G A100 for ~24 hours. Data generation requires ~200 CPU cores and 50G RAM.
  • Data: The slimpajama-per-source-length-upsample dataset (1.8T) is required for data generation.
  • Links: HF Repo, Paper

Highlighted Details

  • Achieves GPT-4 level long-context retrieval performance.
  • Offers continued pre-trained LLaMA-2 7B (80K context) and 13B (64K context) checkpoints.
  • Implements custom tensor parallelism for faster inference than Hugging Face's device_map or vLLM.
  • Includes evaluation scripts for Needle-in-a-Haystack and BookQA datasets.

Maintenance & Community

The project is associated with Yao Fu and Hannaneh Hajishirzi, authors of the referenced paper. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository itself appears to be under an unspecified license. The underlying models (LLaMA-2) are subject to Meta's Llama 2 Community License. The data used (SlimPajama) is also subject to its own licensing terms. Compatibility for commercial use depends on the licenses of LLaMA-2 and SlimPajama.

Limitations & Caveats

The custom tensor parallelism implementation may silently fail on insufficient GPU memory instead of raising an error. Tokenization of very long documents can be a performance bottleneck. The project notes a discrepancy in the longbook_qa_eng dataset version used compared to the original InfiniBench upload.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
9 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), George Hotz George Hotz(Author of tinygrad; Founder of the tiny corp, comma.ai), and
10 more.

TinyLlama by jzhang38

0.3%
9k
Tiny pretraining project for a 1.1B Llama model
created 1 year ago
updated 1 year ago
Feedback? Help us improve.