Discover and explore top open-source AI tools and projects—updated daily.
mlfoundationsFramework for LLM dataset creation, training, and evaluation
Top 29.2% on SourcePulse
DataComp-LM (DCLM) is a framework for building and training Large Language Models (LLMs) with diverse datasets, targeting researchers and practitioners. It provides a standardized corpus, pretraining recipes, and evaluation suite to facilitate experimentation with dataset construction strategies, aiming to improve model performance and reduce training costs.
How It Works
DCLM follows a five-step workflow: raw source selection, data processing, tokenization/shuffling, model training, and evaluation. It leverages Ray for distributed data processing and offers both Rust-based and Ray-based tokenization/shuffling. The framework uses a reference JSON system to track datasets, models, and evaluations, enabling reproducible experiments.
Quick Start & Requirements
git clone https://github.com/mlfoundations/DCLM.git, cd DCLM, pip install -r requirements.txt. Ensure cmake, build-essential, and g++ are installed.python setup.py install.Highlighted Details
torchrun and evaluation via tools/eval_expdb.py or eval/eval_openlm_ckpt.py.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 week
google
XueFuzhao
minimaxir
huggingface
facebookresearch