dclm  by mlfoundations

Framework for LLM dataset creation, training, and evaluation

created 1 year ago
1,342 stars

Top 30.5% on sourcepulse

GitHubView on GitHub
Project Summary

DataComp-LM (DCLM) is a framework for building and training Large Language Models (LLMs) with diverse datasets, targeting researchers and practitioners. It provides a standardized corpus, pretraining recipes, and evaluation suite to facilitate experimentation with dataset construction strategies, aiming to improve model performance and reduce training costs.

How It Works

DCLM follows a five-step workflow: raw source selection, data processing, tokenization/shuffling, model training, and evaluation. It leverages Ray for distributed data processing and offers both Rust-based and Ray-based tokenization/shuffling. The framework uses a reference JSON system to track datasets, models, and evaluations, enabling reproducible experiments.

Quick Start & Requirements

  • Install: git clone https://github.com/mlfoundations/DCLM.git, cd DCLM, pip install -r requirements.txt. Ensure cmake, build-essential, and g++ are installed.
  • Prerequisites: Python 3.10 recommended. AWS credentials for data access and potential compute backend. Ray for distributed processing.
  • Setup: Requires downloading models and data via python setup.py install.
  • Docs: Workflow Overview

Highlighted Details

  • Offers a 300T token corpus and supports model scales from 411M to 7B parameters.
  • Includes a leaderboard showcasing community submissions and performance comparisons.
  • Provides Rust-based deduplication tools for efficient inter-document fuzzy matching.
  • Supports training with torchrun and evaluation via tools/eval_expdb.py or eval/eval_openlm_ckpt.py.

Maintenance & Community

  • The project is actively maintained by the mlfoundations team.
  • Contributions are welcomed via pull requests and issue reporting.
  • Citation details are provided for research use.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Rust-based deduplication tools are not directly integrable with Ray-based pipelines.
  • Some example commands require manual JSON updates for dataset paths.
  • AWS credentials are required for accessing data from Common Crawl.
Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
1
Star History
55 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

towhee by towhee-io

0.2%
3k
Framework for neural data processing pipelines
created 4 years ago
updated 9 months ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Feedback? Help us improve.