deduplicate-text-datasets by google-research

Rust CLI tool for language model dataset deduplication

Created 4 years ago

1,253 stars

Top 31.5% on SourcePulse

View on GitHub

4 Experts Love This Project

Cofounder of Cloudera

Morgan Funtowicz

Head of ML Optimizations at Hugging Face

Project Summary

This repository provides tools for deduplicating large text datasets, primarily for training language models. It implements exact substring deduplication using a Rust-based suffix array approach, aiming to improve training efficiency, reduce model memorization, and potentially enhance perplexity. The target audience includes ML researchers and engineers working with large-scale text corpora.

How It Works

The core of the deduplication process relies on building a suffix array for the entire dataset. This allows for efficient identification of repeated substrings. The implementation uses 64-bit integers and handles byte arrays, not just UTF-8 strings, to accommodate tokenized sequences. The process involves identifying duplicate substrings, collecting them into byte ranges, and then removing these ranges from the dataset. This approach is advantageous due to its linear time complexity for suffix array construction and logarithmic time for querying, enabling efficient processing of massive datasets.

Quick Start & Requirements

Installation: Requires Rust (rustup.rs) and a C compiler (sudo apt-get install gcc). Python dependencies include numpy, scipy, sentencepiece, and requirements-tf.txt.
Basic Usage: Compile Rust code with cargo build. Load datasets using python3 scripts/load_dataset.py. Build suffix arrays with python3 scripts/make_suffix_array.py.
Resources: Small datasets (<10GB) require ~16GB RAM and a few CPU cores. Large datasets (e.g., C4 ~300GB) require many cores (e.g., 96) and >600GB RAM, plus >1TB disk space.
Links: Official Quick Start

Highlighted Details

Implements exact substring deduplication using suffix arrays.
Rust implementation is optimized for performance and memory efficiency.
Provides scripts for processing TensorFlow Datasets (TFDS) and single files.
Offers tools for counting occurrences, finding duplicates within and between documents.

Maintenance & Community

This is not an officially supported Google product. The repository is associated with research by Katherine Lee, Daphne Ippolito, and others. Version 1.0.0 represents a significant restructuring and is not backward compatible with 0.1.0.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The code is described as "research code" and may not directly suit all use cases. The make command for suffix array construction is single-threaded and memory-prohibitive for large files. The across-similar command requires the entire dataset to fit in memory. The finish_dedup_wiki40b.py script is noted as potentially slow and could benefit from parallelization.

Health Check

Last Commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days