dclm by mlfoundations

Framework for LLM dataset creation, training, and evaluation

Created 1 year ago

1,405 stars

Top 28.7% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Teknium

Cofounder of Nous Research

Alexander Wettig

Coauthor of SWE-bench, SWE-agent

Project Summary

DataComp-LM (DCLM) is a framework for building and training Large Language Models (LLMs) with diverse datasets, targeting researchers and practitioners. It provides a standardized corpus, pretraining recipes, and evaluation suite to facilitate experimentation with dataset construction strategies, aiming to improve model performance and reduce training costs.

How It Works

DCLM follows a five-step workflow: raw source selection, data processing, tokenization/shuffling, model training, and evaluation. It leverages Ray for distributed data processing and offers both Rust-based and Ray-based tokenization/shuffling. The framework uses a reference JSON system to track datasets, models, and evaluations, enabling reproducible experiments.

Quick Start & Requirements

Install: git clone https://github.com/mlfoundations/DCLM.git, cd DCLM, pip install -r requirements.txt. Ensure cmake, build-essential, and g++ are installed.
Prerequisites: Python 3.10 recommended. AWS credentials for data access and potential compute backend. Ray for distributed processing.
Setup: Requires downloading models and data via python setup.py install.
Docs: Workflow Overview

Highlighted Details

Offers a 300T token corpus and supports model scales from 411M to 7B parameters.
Includes a leaderboard showcasing community submissions and performance comparisons.
Provides Rust-based deduplication tools for efficient inter-document fuzzy matching.
Supports training with torchrun and evaluation via tools/eval_expdb.py or eval/eval_openlm_ckpt.py.

Maintenance & Community

The project is actively maintained by the mlfoundations team.
Contributions are welcomed via pull requests and issue reporting.
Citation details are provided for research use.

Licensing & Compatibility

Licensed under the MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

Rust-based deduplication tools are not directly integrable with Ray-based pipelines.
Some example commands require manual JSON updates for dataset paths.
AWS credentials are required for accessing data from Common Crawl.

Health Check

Last Commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days