Data processing library for large-scale text data
Top 19.0% on sourcepulse
DataTrove is a Python library designed for large-scale text data processing, filtering, and deduplication, primarily targeting LLM data preparation. It offers a platform-agnostic framework with pre-built processing blocks and the ability to integrate custom functionality, enabling efficient data manipulation locally or on distributed systems like Slurm.
How It Works
DataTrove pipelines are composed of sequential blocks, each processing Document
objects (containing text
, id
, and metadata
). The library uses an executor model (LocalPipelineExecutor
, SlurmPipelineExecutor
) to manage distributed execution. Tasks are the unit of parallelization, processing shards of input files. DataTrove tracks completed tasks, allowing for resilient restarts. It leverages fsspec
for broad filesystem compatibility (local, S3, etc.) and supports various data formats and extraction methods like Trafilatura for HTML.
Quick Start & Requirements
Install with pip install datatrove[FLAVOUR]
, where FLAVOUR
can be all
, io
, processing
, s3
, or cli
. Examples are available for reproducing datasets like FineWeb and processing Common Crawl.
Highlighted Details
fsspec
.Maintenance & Community
Developed by Hugging Face, with contributions from Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, and Thomas Wolf.
Licensing & Compatibility
Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.
Limitations & Caveats
Files are processed by a single task; DataTrove does not automatically split large files. The number of tasks should ideally not exceed the number of input files for optimal parallelization.
1 day ago
1 day