dolma by allenai

Toolkit for curating datasets for language model pre-training

Created 2 years ago

1,412 stars

Top 28.3% on SourcePulse

View on GitHub

6 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elvis Saravia

Founder of DAIR.AI

Jeff Hammerbacher

Cofounder of Cloudera

Luca Soldaini

Research Scientist at Ai2

and 2 more!

Project Summary

This repository provides the Dolma Toolkit, a high-performance, portable, and extensible toolkit for curating large datasets for language model pre-training. It enables efficient processing of billions of documents with built-in taggers and fast deduplication, targeting researchers and engineers building large language models.

How It Works

The toolkit leverages built-in parallelism for concurrent document processing, making it suitable for single machines, clusters, or cloud environments. It features ready-to-use taggers for common dataset curation tasks (e.g., Gopher, C4, OpenWebText) and employs a Rust Bloom filter for speedy document deduplication. Extensibility allows for custom taggers and AWS S3-compatible storage integration.

Quick Start & Requirements

Install via pip: pip install dolma
Documentation: https://allenai.github.io/dolma/

Highlighted Details

Supports curation of datasets up to 3 trillion tokens.
Includes built-in taggers for common dataset formats.
Features fast document deduplication using a Rust Bloom filter.
Extensible for custom taggers and cloud storage.

Maintenance & Community

Developed by the Allen Institute for AI (AI2).
Citation details provided for the Dolma dataset and toolkit.

Licensing & Compatibility

Dolma Dataset is licensed under ODC-BY.
The repository's license is not explicitly stated in the README, but it is associated with AI2's open-source efforts.

Limitations & Caveats

The README does not specify the license for the Dolma Toolkit itself, only for the associated dataset. Further clarification on the toolkit's licensing would be beneficial for commercial use considerations.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days