datatrove  by huggingface

Data processing library for large-scale text data

Created 2 years ago
2,631 stars

Top 17.9% on SourcePulse

GitHubView on GitHub
Project Summary

DataTrove is a Python library designed for large-scale text data processing, filtering, and deduplication, primarily targeting LLM data preparation. It offers a platform-agnostic framework with pre-built processing blocks and the ability to integrate custom functionality, enabling efficient data manipulation locally or on distributed systems like Slurm.

How It Works

DataTrove pipelines are composed of sequential blocks, each processing Document objects (containing text, id, and metadata). The library uses an executor model (LocalPipelineExecutor, SlurmPipelineExecutor) to manage distributed execution. Tasks are the unit of parallelization, processing shards of input files. DataTrove tracks completed tasks, allowing for resilient restarts. It leverages fsspec for broad filesystem compatibility (local, S3, etc.) and supports various data formats and extraction methods like Trafilatura for HTML.

Quick Start & Requirements

Install with pip install datatrove[FLAVOUR], where FLAVOUR can be all, io, processing, s3, or cli. Examples are available for reproducing datasets like FineWeb and processing Common Crawl.

Highlighted Details

  • Platform-agnostic pipelines runnable on local machines or Slurm clusters.
  • Resilient execution with task completion tracking for restarts.
  • Extensive support for various data formats and filesystems via fsspec.
  • Comprehensive suite of pre-built blocks for reading, writing, filtering, extraction, deduplication, and statistics.

Maintenance & Community

Developed by Hugging Face, with contributions from Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, and Thomas Wolf.

Licensing & Compatibility

Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Files are processed by a single task; DataTrove does not automatically split large files. The number of tasks should ideally not exceed the number of input files for optimal parallelization.

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
2
Star History
83 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Elvis Saravia Elvis Saravia(Founder of DAIR.AI), and
4 more.

dolma by allenai

0.2%
1k
Toolkit for curating datasets for language model pre-training
Created 2 years ago
Updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Alexander Wettig Alexander Wettig(Coauthor of SWE-bench, SWE-agent), and
5 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
Created 2 years ago
Updated 23 hours ago
Feedback? Help us improve.