llm-decontaminator by lm-sys

LLM contamination detector for quantifying rephrased samples

Created 2 years ago

315 stars

Top 85.8% on SourcePulse

View on GitHub

5 Experts Love This Project

Casper Hansen

Author of AutoAWQ

Binyuan Hui

Research Scientist at Alibaba Qwen

Lianmin Zheng

Coauthor of SGLang, vLLM

Ying Sheng

Coauthor of SGLang

and 1 more!

Project Summary

This repository provides tools to detect and mitigate data contamination in large language model (LLM) training datasets by identifying rephrased samples that overlap with benchmark datasets. It is designed for researchers and practitioners working with LLMs who need to ensure the integrity and reliability of their training data and evaluation results.

How It Works

The core of the decontaminator leverages a similarity detection mechanism, likely based on embeddings or a language model itself, to compare samples from a training dataset against a benchmark dataset. It quantifies the degree of rephrasing and contamination, allowing users to estimate the percentage of overlapping data and subsequently filter out these potentially problematic samples from the training set.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n llm-detect python=3.9), activate it (conda activate llm-detect), and install dependencies (pip install -r requirement.txt).
Prerequisites: Python 3.9, datasets library, OpenAI API key (for end-to-end execution).
Data Format: Training and test sets should be in JSONL format, with each line containing {"text": data}.
Example: The README provides a script to load and preprocess data from the Hugging Face datasets library.
End-to-End Execution: python3 main.py --train_path <train_data.jsonl> --test_path <test_data.jsonl> --output_path <output_db.jsonl> --data-type code --top_k 1 (requires OPENAI_API_KEY environment variable).
Resources: Processing 500k samples from StarCoder-Data is demonstrated.

Highlighted Details

Quantifies rephrased samples relative to benchmarks like HumanEval, MATH Test, and MMLU.
Provides contamination percentages for various real-world datasets.
Includes scripts to reproduce F1 scores reported in the associated paper (Table 5 & 6).
Offers code for dataset preprocessing and end-to-end contamination detection.

Maintenance & Community

The project is associated with the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples" by Shuo Yang et al. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The end-to-end execution relies on an OpenAI API key, which may incur costs and introduce external dependencies. The effectiveness of contamination detection is dependent on the chosen similarity metric and parameters (top_k).

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days