dclm  by mlfoundations

Framework for LLM dataset creation, training, and evaluation

Created 1 year ago
1,368 stars

Top 29.4% on SourcePulse

GitHubView on GitHub
Project Summary

DataComp-LM (DCLM) is a framework for building and training Large Language Models (LLMs) with diverse datasets, targeting researchers and practitioners. It provides a standardized corpus, pretraining recipes, and evaluation suite to facilitate experimentation with dataset construction strategies, aiming to improve model performance and reduce training costs.

How It Works

DCLM follows a five-step workflow: raw source selection, data processing, tokenization/shuffling, model training, and evaluation. It leverages Ray for distributed data processing and offers both Rust-based and Ray-based tokenization/shuffling. The framework uses a reference JSON system to track datasets, models, and evaluations, enabling reproducible experiments.

Quick Start & Requirements

  • Install: git clone https://github.com/mlfoundations/DCLM.git, cd DCLM, pip install -r requirements.txt. Ensure cmake, build-essential, and g++ are installed.
  • Prerequisites: Python 3.10 recommended. AWS credentials for data access and potential compute backend. Ray for distributed processing.
  • Setup: Requires downloading models and data via python setup.py install.
  • Docs: Workflow Overview

Highlighted Details

  • Offers a 300T token corpus and supports model scales from 411M to 7B parameters.
  • Includes a leaderboard showcasing community submissions and performance comparisons.
  • Provides Rust-based deduplication tools for efficient inter-document fuzzy matching.
  • Supports training with torchrun and evaluation via tools/eval_expdb.py or eval/eval_openlm_ckpt.py.

Maintenance & Community

  • The project is actively maintained by the mlfoundations team.
  • Contributions are welcomed via pull requests and issue reporting.
  • Citation details are provided for research use.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • Rust-based deduplication tools are not directly integrable with Ray-based pipelines.
  • Some example commands require manual JSON updates for dataset paths.
  • AWS credentials are required for accessing data from Common Crawl.
Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
0
Star History
20 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
11 more.

datatrove by huggingface

0.9%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 2 days ago
Starred by Théophile Gervet Théophile Gervet(Cofounder of Genesis AI), Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), and
6 more.

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
Created 11 months ago
Updated 2 months ago
Feedback? Help us improve.