datatrove by huggingface

Data processing library for large-scale text data

Created 2 years ago

2,814 stars

Top 16.8% on SourcePulse

View on GitHub

14 Experts Love This Project

Lewis Tunstall

Research Engineer at Hugging Face

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Amanpreet Singh

Cofounder of Contextual AI

Maxime Labonne

Head of Post-Training at Liquid AI

and 10 more!

Project Summary

DataTrove is a Python library designed for large-scale text data processing, filtering, and deduplication, primarily targeting LLM data preparation. It offers a platform-agnostic framework with pre-built processing blocks and the ability to integrate custom functionality, enabling efficient data manipulation locally or on distributed systems like Slurm.

How It Works

DataTrove pipelines are composed of sequential blocks, each processing Document objects (containing text, id, and metadata). The library uses an executor model (LocalPipelineExecutor, SlurmPipelineExecutor) to manage distributed execution. Tasks are the unit of parallelization, processing shards of input files. DataTrove tracks completed tasks, allowing for resilient restarts. It leverages fsspec for broad filesystem compatibility (local, S3, etc.) and supports various data formats and extraction methods like Trafilatura for HTML.

Quick Start & Requirements

Install with pip install datatrove[FLAVOUR], where FLAVOUR can be all, io, processing, s3, or cli. Examples are available for reproducing datasets like FineWeb and processing Common Crawl.

Highlighted Details

Platform-agnostic pipelines runnable on local machines or Slurm clusters.
Resilient execution with task completion tracking for restarts.
Extensive support for various data formats and filesystems via fsspec.
Comprehensive suite of pre-built blocks for reading, writing, filtering, extraction, deduplication, and statistics.

Maintenance & Community

Developed by Hugging Face, with contributions from Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, and Thomas Wolf.

Licensing & Compatibility

Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Files are processed by a single task; DataTrove does not automatically split large files. The number of tasks should ideally not exceed the number of input files for optimal parallelization.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

53 stars in the last 30 days