Research paper on scaling data-constrained language models
Top 82.6% on sourcepulse
This repository provides the code, models, and datasets for the paper "Scaling Data-Constrained Language Models," which investigates optimizing language model training under data constraints. It offers tools and pre-trained models for researchers and practitioners aiming to improve efficiency and performance by managing data repetition and compute budgets.
How It Works
The project explores data-constrained scaling laws for language models, focusing on the diminishing returns of data repetition and excess parameters. It introduces a compute-optimal scaling law that accounts for these factors and provides methods for mitigating data scarcity through techniques like code augmentation, perplexity filtering, and deduplication. The core approach involves extensive experimentation with varying data repetition levels and compute budgets, up to 900 billion tokens and 9 billion parameters.
Quick Start & Requirements
Megatron-DeepSpeed
repository and using its preprocess_data_many_cores.py
script. Tokenization uses gpt2
tokenizer.lm-evaluation-harness
(pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
). Specific evaluation tasks may require cloning different branches of the harness.Megatron-DeepSpeed
(https://github.com/TurkuNLP/Megatron-DeepSpeed
) or NVIDIA GPUs via the original library (https://github.com/bigscience-workshop/Megatron-DeepSpeed
). Setup instructions are detailed in training/megdssetup.md
.datasets
, numpy
, transformers
. Specific model training may require CUDA or ROCm.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Megatron-DeepSpeed
and large-scale training infrastructure.lm-evaluation-harness
.1 month ago
1 day