Long-Context-Data-Engineering by FranxYao

Research paper implementation for long-context data engineering

Created 1 year ago

482 stars

Top 63.6% on SourcePulse

View on GitHub

2 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository provides the data engineering implementation for scaling language models to 128K context, as detailed in the paper "Data Engineering for Scaling Language Models to 128K Context." It enables researchers and practitioners to replicate the long-context retrieval performance of models like GPT-4 by offering pre-trained checkpoints and data processing tools.

How It Works

The project focuses on adapting existing LLaMA models (7B and 13B) through continued pre-training on a custom-processed dataset designed for long contexts. It utilizes a modified Rotary Positional Embedding (RoPE) scaling technique and a custom tensor parallelism implementation for efficient inference on long sequences. The data processing involves upsampling sequences from the SlimPajama dataset to create chunks of 131072 tokens.

Quick Start & Requirements

Install dependencies: pip install -r requirements.txt (PyTorch assumed pre-installed).
Download checkpoints: Use provided Python snippets to download yaofu/llama-2-7b-80k and yaofu/llama-2-13b-64k.
Hardware: Requires significant GPU resources (e.g., 8x 4090 for 80K context, 4x 80G A100 for 128K context). Evaluation requires 4x 80G A100 for ~24 hours. Data generation requires ~200 CPU cores and 50G RAM.
Data: The slimpajama-per-source-length-upsample dataset (1.8T) is required for data generation.
Links: HF Repo, Paper

Highlighted Details

Achieves GPT-4 level long-context retrieval performance.
Offers continued pre-trained LLaMA-2 7B (80K context) and 13B (64K context) checkpoints.
Implements custom tensor parallelism for faster inference than Hugging Face's device_map or vLLM.
Includes evaluation scripts for Needle-in-a-Haystack and BookQA datasets.

Maintenance & Community

The project is associated with Yao Fu and Hannaneh Hajishirzi, authors of the referenced paper. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository itself appears to be under an unspecified license. The underlying models (LLaMA-2) are subject to Meta's Llama 2 Community License. The data used (SlimPajama) is also subject to its own licensing terms. Compatibility for commercial use depends on the licenses of LLaMA-2 and SlimPajama.

Limitations & Caveats

The custom tensor parallelism implementation may silently fail on insufficient GPU memory instead of raising an error. Tokenization of very long documents can be a performance bottleneck. The project notes a discrepancy in the longbook_qa_eng dataset version used compared to the original InfiniBench upload.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days