Research paper implementation for long-context data engineering
Top 66.0% on sourcepulse
This repository provides the data engineering implementation for scaling language models to 128K context, as detailed in the paper "Data Engineering for Scaling Language Models to 128K Context." It enables researchers and practitioners to replicate the long-context retrieval performance of models like GPT-4 by offering pre-trained checkpoints and data processing tools.
How It Works
The project focuses on adapting existing LLaMA models (7B and 13B) through continued pre-training on a custom-processed dataset designed for long contexts. It utilizes a modified Rotary Positional Embedding (RoPE) scaling technique and a custom tensor parallelism implementation for efficient inference on long sequences. The data processing involves upsampling sequences from the SlimPajama dataset to create chunks of 131072 tokens.
Quick Start & Requirements
pip install -r requirements.txt
(PyTorch assumed pre-installed).yaofu/llama-2-7b-80k
and yaofu/llama-2-13b-64k
.slimpajama-per-source-length-upsample
dataset (1.8T) is required for data generation.Highlighted Details
device_map
or vLLM.Maintenance & Community
The project is associated with Yao Fu and Hannaneh Hajishirzi, authors of the referenced paper. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository itself appears to be under an unspecified license. The underlying models (LLaMA-2) are subject to Meta's Llama 2 Community License. The data used (SlimPajama) is also subject to its own licensing terms. Compatibility for commercial use depends on the licenses of LLaMA-2 and SlimPajama.
Limitations & Caveats
The custom tensor parallelism implementation may silently fail on insufficient GPU memory instead of raising an error. Tokenization of very long documents can be a performance bottleneck. The project notes a discrepancy in the longbook_qa_eng
dataset version used compared to the original InfiniBench upload.
1 year ago
1 day