datablations  by huggingface

Research paper on scaling data-constrained language models

Created 2 years ago
342 stars

Top 80.8% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the code, models, and datasets for the paper "Scaling Data-Constrained Language Models," which investigates optimizing language model training under data constraints. It offers tools and pre-trained models for researchers and practitioners aiming to improve efficiency and performance by managing data repetition and compute budgets.

How It Works

The project explores data-constrained scaling laws for language models, focusing on the diminishing returns of data repetition and excess parameters. It introduces a compute-optimal scaling law that accounts for these factors and provides methods for mitigating data scarcity through techniques like code augmentation, perplexity filtering, and deduplication. The core approach involves extensive experimentation with varying data repetition levels and compute budgets, up to 900 billion tokens and 9 billion parameters.

Quick Start & Requirements

  • Data Preprocessing: Requires cloning the Megatron-DeepSpeed repository and using its preprocess_data_many_cores.py script. Tokenization uses gpt2 tokenizer.
  • Evaluation: Requires installing lm-evaluation-harness (pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git). Specific evaluation tasks may require cloning different branches of the harness.
  • Training: Supports AMD GPUs via a fork of Megatron-DeepSpeed (https://github.com/TurkuNLP/Megatron-DeepSpeed) or NVIDIA GPUs via the original library (https://github.com/bigscience-workshop/Megatron-DeepSpeed). Setup instructions are detailed in training/megdssetup.md.
  • Dependencies: Python, datasets, numpy, transformers. Specific model training may require CUDA or ROCm.
  • Resources: Preprocessed datasets and models are available on Hugging Face Hub. Training large models requires significant compute resources.

Highlighted Details

  • Provides 400 pre-trained models and datasets covering various configurations of parameters, tokens, and data augmentation strategies.
  • Includes a parametric scaling law formula and Python code for calculating expected loss and optimal model/data allocation.
  • Offers detailed scripts and instructions for data preprocessing, training, and downstream evaluation (accuracy, generative, bAbI).
  • Contains extensive plotting and table generation scripts to reproduce figures and tables from the paper.

Maintenance & Community

  • The primary author is Niklas Muennighoff.
  • The project is associated with Hugging Face and potentially other research institutions mentioned in the paper.
  • No specific community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

  • All code and models are licensed under Apache 2.0.
  • Filtered datasets inherit the license of their source datasets.
  • Apache 2.0 is permissive for commercial use and closed-source linking.

Limitations & Caveats

  • The README details complex data preprocessing and training setups that require familiarity with Megatron-DeepSpeed and large-scale training infrastructure.
  • Some model checkpoints are split into multiple files due to size limitations, requiring concatenation before use.
  • Specific evaluation tasks require cloning different, potentially incompatible, branches of the lm-evaluation-harness.
Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
462
MoE model for research
Created 4 months ago
Updated 4 weeks ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
2 more.

recurrent-pretraining by seal-rg

0%
827
Pretraining code for depth-recurrent language model research
Created 7 months ago
Updated 1 week ago
Feedback? Help us improve.