Framework for LLM dataset creation, training, and evaluation
Top 30.5% on sourcepulse
DataComp-LM (DCLM) is a framework for building and training Large Language Models (LLMs) with diverse datasets, targeting researchers and practitioners. It provides a standardized corpus, pretraining recipes, and evaluation suite to facilitate experimentation with dataset construction strategies, aiming to improve model performance and reduce training costs.
How It Works
DCLM follows a five-step workflow: raw source selection, data processing, tokenization/shuffling, model training, and evaluation. It leverages Ray for distributed data processing and offers both Rust-based and Ray-based tokenization/shuffling. The framework uses a reference JSON system to track datasets, models, and evaluations, enabling reproducible experiments.
Quick Start & Requirements
git clone https://github.com/mlfoundations/DCLM.git
, cd DCLM
, pip install -r requirements.txt
. Ensure cmake
, build-essential
, and g++
are installed.python setup.py install
.Highlighted Details
torchrun
and evaluation via tools/eval_expdb.py
or eval/eval_openlm_ckpt.py
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
4 months ago
1 week