dolma  by allenai

Toolkit for curating datasets for language model pre-training

created 2 years ago
1,280 stars

Top 31.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the Dolma Toolkit, a high-performance, portable, and extensible toolkit for curating large datasets for language model pre-training. It enables efficient processing of billions of documents with built-in taggers and fast deduplication, targeting researchers and engineers building large language models.

How It Works

The toolkit leverages built-in parallelism for concurrent document processing, making it suitable for single machines, clusters, or cloud environments. It features ready-to-use taggers for common dataset curation tasks (e.g., Gopher, C4, OpenWebText) and employs a Rust Bloom filter for speedy document deduplication. Extensibility allows for custom taggers and AWS S3-compatible storage integration.

Quick Start & Requirements

Highlighted Details

  • Supports curation of datasets up to 3 trillion tokens.
  • Includes built-in taggers for common dataset formats.
  • Features fast document deduplication using a Rust Bloom filter.
  • Extensible for custom taggers and cloud storage.

Maintenance & Community

  • Developed by the Allen Institute for AI (AI2).
  • Citation details provided for the Dolma dataset and toolkit.

Licensing & Compatibility

  • Dolma Dataset is licensed under ODC-BY.
  • The repository's license is not explicitly stated in the README, but it is associated with AI2's open-source efforts.

Limitations & Caveats

The README does not specify the license for the Dolma Toolkit itself, only for the associated dataset. Further clarification on the toolkit's licensing would be beneficial for commercial use considerations.

Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
82 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 23 hours ago
Feedback? Help us improve.