datatrove  by huggingface

Data processing library for large-scale text data

created 2 years ago
2,516 stars

Top 19.0% on sourcepulse

GitHubView on GitHub
Project Summary

DataTrove is a Python library designed for large-scale text data processing, filtering, and deduplication, primarily targeting LLM data preparation. It offers a platform-agnostic framework with pre-built processing blocks and the ability to integrate custom functionality, enabling efficient data manipulation locally or on distributed systems like Slurm.

How It Works

DataTrove pipelines are composed of sequential blocks, each processing Document objects (containing text, id, and metadata). The library uses an executor model (LocalPipelineExecutor, SlurmPipelineExecutor) to manage distributed execution. Tasks are the unit of parallelization, processing shards of input files. DataTrove tracks completed tasks, allowing for resilient restarts. It leverages fsspec for broad filesystem compatibility (local, S3, etc.) and supports various data formats and extraction methods like Trafilatura for HTML.

Quick Start & Requirements

Install with pip install datatrove[FLAVOUR], where FLAVOUR can be all, io, processing, s3, or cli. Examples are available for reproducing datasets like FineWeb and processing Common Crawl.

Highlighted Details

  • Platform-agnostic pipelines runnable on local machines or Slurm clusters.
  • Resilient execution with task completion tracking for restarts.
  • Extensive support for various data formats and filesystems via fsspec.
  • Comprehensive suite of pre-built blocks for reading, writing, filtering, extraction, deduplication, and statistics.

Maintenance & Community

Developed by Hugging Face, with contributions from Guilherme Penedo, Hynek Kydlíček, Alessandro Cappelli, Mario Sasko, and Thomas Wolf.

Licensing & Compatibility

Apache 2.0 License. Permissive for commercial use and integration with closed-source projects.

Limitations & Caveats

Files are processed by a single task; DataTrove does not automatically split large files. The number of tasks should ideally not exceed the number of input files for optimal parallelization.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
160 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Alexander Wettig Alexander Wettig(Author of SWE-bench, SWE-agent), and
2 more.

data-juicer by modelscope

0.7%
5k
Data-Juicer: Data processing system for foundation models
created 2 years ago
updated 1 day ago
Starred by Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake) and Travis Fischer Travis Fischer(Founder of Agentic).

lingua by facebookresearch

0.1%
5k
LLM research codebase for training and inference
created 9 months ago
updated 2 weeks ago
Feedback? Help us improve.