datablations  by huggingface

Research paper on scaling data-constrained language models

created 2 years ago
338 stars

Top 82.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the code, models, and datasets for the paper "Scaling Data-Constrained Language Models," which investigates optimizing language model training under data constraints. It offers tools and pre-trained models for researchers and practitioners aiming to improve efficiency and performance by managing data repetition and compute budgets.

How It Works

The project explores data-constrained scaling laws for language models, focusing on the diminishing returns of data repetition and excess parameters. It introduces a compute-optimal scaling law that accounts for these factors and provides methods for mitigating data scarcity through techniques like code augmentation, perplexity filtering, and deduplication. The core approach involves extensive experimentation with varying data repetition levels and compute budgets, up to 900 billion tokens and 9 billion parameters.

Quick Start & Requirements

  • Data Preprocessing: Requires cloning the Megatron-DeepSpeed repository and using its preprocess_data_many_cores.py script. Tokenization uses gpt2 tokenizer.
  • Evaluation: Requires installing lm-evaluation-harness (pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git). Specific evaluation tasks may require cloning different branches of the harness.
  • Training: Supports AMD GPUs via a fork of Megatron-DeepSpeed (https://github.com/TurkuNLP/Megatron-DeepSpeed) or NVIDIA GPUs via the original library (https://github.com/bigscience-workshop/Megatron-DeepSpeed). Setup instructions are detailed in training/megdssetup.md.
  • Dependencies: Python, datasets, numpy, transformers. Specific model training may require CUDA or ROCm.
  • Resources: Preprocessed datasets and models are available on Hugging Face Hub. Training large models requires significant compute resources.

Highlighted Details

  • Provides 400 pre-trained models and datasets covering various configurations of parameters, tokens, and data augmentation strategies.
  • Includes a parametric scaling law formula and Python code for calculating expected loss and optimal model/data allocation.
  • Offers detailed scripts and instructions for data preprocessing, training, and downstream evaluation (accuracy, generative, bAbI).
  • Contains extensive plotting and table generation scripts to reproduce figures and tables from the paper.

Maintenance & Community

  • The primary author is Niklas Muennighoff.
  • The project is associated with Hugging Face and potentially other research institutions mentioned in the paper.
  • No specific community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

  • All code and models are licensed under Apache 2.0.
  • Filtered datasets inherit the license of their source datasets.
  • Apache 2.0 is permissive for commercial use and closed-source linking.

Limitations & Caveats

  • The README details complex data preprocessing and training setups that require familiarity with Megatron-DeepSpeed and large-scale training infrastructure.
  • Some model checkpoints are split into multiple files due to size limitations, requiring concatenation before use.
  • Specific evaluation tasks require cloning different, potentially incompatible, branches of the lm-evaluation-harness.
Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alex Cheema Alex Cheema(Cofounder of EXO Labs), and
1 more.

recurrent-pretraining by seal-rg

0.1%
806
Pretraining code for depth-recurrent language model research
created 5 months ago
updated 2 weeks ago
Feedback? Help us improve.