Rust CLI tool for language model dataset deduplication
Top 32.6% on sourcepulse
This repository provides tools for deduplicating large text datasets, primarily for training language models. It implements exact substring deduplication using a Rust-based suffix array approach, aiming to improve training efficiency, reduce model memorization, and potentially enhance perplexity. The target audience includes ML researchers and engineers working with large-scale text corpora.
How It Works
The core of the deduplication process relies on building a suffix array for the entire dataset. This allows for efficient identification of repeated substrings. The implementation uses 64-bit integers and handles byte arrays, not just UTF-8 strings, to accommodate tokenized sequences. The process involves identifying duplicate substrings, collecting them into byte ranges, and then removing these ranges from the dataset. This approach is advantageous due to its linear time complexity for suffix array construction and logarithmic time for querying, enabling efficient processing of massive datasets.
Quick Start & Requirements
rustup.rs
) and a C compiler (sudo apt-get install gcc
). Python dependencies include numpy
, scipy
, sentencepiece
, and requirements-tf.txt
.cargo build
. Load datasets using python3 scripts/load_dataset.py
. Build suffix arrays with python3 scripts/make_suffix_array.py
.Highlighted Details
Maintenance & Community
This is not an officially supported Google product. The repository is associated with research by Katherine Lee, Daphne Ippolito, and others. Version 1.0.0 represents a significant restructuring and is not backward compatible with 0.1.0.
Licensing & Compatibility
The repository does not explicitly state a license in the README. This requires further investigation for commercial use or integration into closed-source projects.
Limitations & Caveats
The code is described as "research code" and may not directly suit all use cases. The make
command for suffix array construction is single-threaded and memory-prohibitive for large files. The across-similar
command requires the entire dataset to fit in memory. The finish_dedup_wiki40b.py
script is noted as potentially slow and could benefit from parallelization.
1 year ago
Inactive