deduplicate-text-datasets  by google-research

Rust CLI tool for language model dataset deduplication

created 4 years ago
1,230 stars

Top 32.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools for deduplicating large text datasets, primarily for training language models. It implements exact substring deduplication using a Rust-based suffix array approach, aiming to improve training efficiency, reduce model memorization, and potentially enhance perplexity. The target audience includes ML researchers and engineers working with large-scale text corpora.

How It Works

The core of the deduplication process relies on building a suffix array for the entire dataset. This allows for efficient identification of repeated substrings. The implementation uses 64-bit integers and handles byte arrays, not just UTF-8 strings, to accommodate tokenized sequences. The process involves identifying duplicate substrings, collecting them into byte ranges, and then removing these ranges from the dataset. This approach is advantageous due to its linear time complexity for suffix array construction and logarithmic time for querying, enabling efficient processing of massive datasets.

Quick Start & Requirements

  • Installation: Requires Rust (rustup.rs) and a C compiler (sudo apt-get install gcc). Python dependencies include numpy, scipy, sentencepiece, and requirements-tf.txt.
  • Basic Usage: Compile Rust code with cargo build. Load datasets using python3 scripts/load_dataset.py. Build suffix arrays with python3 scripts/make_suffix_array.py.
  • Resources: Small datasets (<10GB) require ~16GB RAM and a few CPU cores. Large datasets (e.g., C4 ~300GB) require many cores (e.g., 96) and >600GB RAM, plus >1TB disk space.
  • Links: Official Quick Start

Highlighted Details

  • Implements exact substring deduplication using suffix arrays.
  • Rust implementation is optimized for performance and memory efficiency.
  • Provides scripts for processing TensorFlow Datasets (TFDS) and single files.
  • Offers tools for counting occurrences, finding duplicates within and between documents.

Maintenance & Community

This is not an officially supported Google product. The repository is associated with research by Katherine Lee, Daphne Ippolito, and others. Version 1.0.0 represents a significant restructuring and is not backward compatible with 0.1.0.

Licensing & Compatibility

The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The code is described as "research code" and may not directly suit all use cases. The make command for suffix array construction is single-threaded and memory-prohibitive for large files. The across-similar command requires the entire dataset to fit in memory. The finish_dedup_wiki40b.py script is noted as potentially slow and could benefit from parallelization.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera).

dolma by allenai

0.4%
1k
Toolkit for curating datasets for language model pre-training
created 2 years ago
updated 1 day ago
Feedback? Help us improve.