dolma  by allenai

Toolkit for curating datasets for language model pre-training

Created 2 years ago
1,392 stars

Top 28.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the Dolma Toolkit, a high-performance, portable, and extensible toolkit for curating large datasets for language model pre-training. It enables efficient processing of billions of documents with built-in taggers and fast deduplication, targeting researchers and engineers building large language models.

How It Works

The toolkit leverages built-in parallelism for concurrent document processing, making it suitable for single machines, clusters, or cloud environments. It features ready-to-use taggers for common dataset curation tasks (e.g., Gopher, C4, OpenWebText) and employs a Rust Bloom filter for speedy document deduplication. Extensibility allows for custom taggers and AWS S3-compatible storage integration.

Quick Start & Requirements

Highlighted Details

  • Supports curation of datasets up to 3 trillion tokens.
  • Includes built-in taggers for common dataset formats.
  • Features fast document deduplication using a Rust Bloom filter.
  • Extensible for custom taggers and cloud storage.

Maintenance & Community

  • Developed by the Allen Institute for AI (AI2).
  • Citation details provided for the Dolma dataset and toolkit.

Licensing & Compatibility

  • Dolma Dataset is licensed under ODC-BY.
  • The repository's license is not explicitly stated in the README, but it is associated with AI2's open-source efforts.

Limitations & Caveats

The README does not specify the license for the Dolma Toolkit itself, only for the associated dataset. Further clarification on the toolkit's licensing would be beneficial for commercial use considerations.

Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
38 stars in the last 30 days

Explore Similar Projects

Starred by Lewis Tunstall Lewis Tunstall(Research Engineer at Hugging Face), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
12 more.

datatrove by huggingface

0.6%
3k
Data processing library for large-scale text data
Created 2 years ago
Updated 3 days ago
Feedback? Help us improve.