LLM contamination detector for quantifying rephrased samples
Top 88.6% on sourcepulse
This repository provides tools to detect and mitigate data contamination in large language model (LLM) training datasets by identifying rephrased samples that overlap with benchmark datasets. It is designed for researchers and practitioners working with LLMs who need to ensure the integrity and reliability of their training data and evaluation results.
How It Works
The core of the decontaminator leverages a similarity detection mechanism, likely based on embeddings or a language model itself, to compare samples from a training dataset against a benchmark dataset. It quantifies the degree of rephrasing and contamination, allowing users to estimate the percentage of overlapping data and subsequently filter out these potentially problematic samples from the training set.
Quick Start & Requirements
conda create -n llm-detect python=3.9
), activate it (conda activate llm-detect
), and install dependencies (pip install -r requirement.txt
).datasets
library, OpenAI API key (for end-to-end execution).{"text": data}
.datasets
library.python3 main.py --train_path <train_data.jsonl> --test_path <test_data.jsonl> --output_path <output_db.jsonl> --data-type code --top_k 1
(requires OPENAI_API_KEY
environment variable).Highlighted Details
Maintenance & Community
The project is associated with the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples" by Shuo Yang et al. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
The end-to-end execution relies on an OpenAI API key, which may incur costs and introduce external dependencies. The effectiveness of contamination detection is dependent on the chosen similarity metric and parameters (top_k
).
1 year ago
1 week