Toolkit for curating datasets for language model pre-training
Top 31.7% on sourcepulse
This repository provides the Dolma Toolkit, a high-performance, portable, and extensible toolkit for curating large datasets for language model pre-training. It enables efficient processing of billions of documents with built-in taggers and fast deduplication, targeting researchers and engineers building large language models.
How It Works
The toolkit leverages built-in parallelism for concurrent document processing, making it suitable for single machines, clusters, or cloud environments. It features ready-to-use taggers for common dataset curation tasks (e.g., Gopher, C4, OpenWebText) and employs a Rust Bloom filter for speedy document deduplication. Extensibility allows for custom taggers and AWS S3-compatible storage integration.
Quick Start & Requirements
pip install dolma
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The README does not specify the license for the Dolma Toolkit itself, only for the associated dataset. Further clarification on the toolkit's licensing would be beneficial for commercial use considerations.
1 day ago
1 week