datablations by huggingface

Research paper on scaling data-constrained language models

Created 3 years ago

343 stars

Top 80.6% on SourcePulse

View on GitHub

3 Experts Love This Project

Omar Sanseviero

DevRel at Google DeepMind

Travis Fischer

Founder of Agentic

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

This repository provides the code, models, and datasets for the paper "Scaling Data-Constrained Language Models," which investigates optimizing language model training under data constraints. It offers tools and pre-trained models for researchers and practitioners aiming to improve efficiency and performance by managing data repetition and compute budgets.

How It Works

The project explores data-constrained scaling laws for language models, focusing on the diminishing returns of data repetition and excess parameters. It introduces a compute-optimal scaling law that accounts for these factors and provides methods for mitigating data scarcity through techniques like code augmentation, perplexity filtering, and deduplication. The core approach involves extensive experimentation with varying data repetition levels and compute budgets, up to 900 billion tokens and 9 billion parameters.

Quick Start & Requirements

Data Preprocessing: Requires cloning the Megatron-DeepSpeed repository and using its preprocess_data_many_cores.py script. Tokenization uses gpt2 tokenizer.
Evaluation: Requires installing lm-evaluation-harness (pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git). Specific evaluation tasks may require cloning different branches of the harness.
Training: Supports AMD GPUs via a fork of Megatron-DeepSpeed (https://github.com/TurkuNLP/Megatron-DeepSpeed) or NVIDIA GPUs via the original library (https://github.com/bigscience-workshop/Megatron-DeepSpeed). Setup instructions are detailed in training/megdssetup.md.
Dependencies: Python, datasets, numpy, transformers. Specific model training may require CUDA or ROCm.
Resources: Preprocessed datasets and models are available on Hugging Face Hub. Training large models requires significant compute resources.

Highlighted Details

Provides 400 pre-trained models and datasets covering various configurations of parameters, tokens, and data augmentation strategies.
Includes a parametric scaling law formula and Python code for calculating expected loss and optimal model/data allocation.
Offers detailed scripts and instructions for data preprocessing, training, and downstream evaluation (accuracy, generative, bAbI).
Contains extensive plotting and table generation scripts to reproduce figures and tables from the paper.

Maintenance & Community

The primary author is Niklas Muennighoff.
The project is associated with Hugging Face and potentially other research institutions mentioned in the paper.
No specific community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

All code and models are licensed under Apache 2.0.
Filtered datasets inherit the license of their source datasets.
Apache 2.0 is permissive for commercial use and closed-source linking.

Limitations & Caveats

The README details complex data preprocessing and training setups that require familiarity with Megatron-DeepSpeed and large-scale training infrastructure.
Some model checkpoints are split into multiple files due to size limitations, requiring concatenation before use.
Specific evaluation tasks require cloning different, potentially incompatible, branches of the lm-evaluation-harness.

Health Check

Last Commit

6 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days