llm-decontaminator  by lm-sys

LLM contamination detector for quantifying rephrased samples

created 1 year ago
306 stars

Top 88.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides tools to detect and mitigate data contamination in large language model (LLM) training datasets by identifying rephrased samples that overlap with benchmark datasets. It is designed for researchers and practitioners working with LLMs who need to ensure the integrity and reliability of their training data and evaluation results.

How It Works

The core of the decontaminator leverages a similarity detection mechanism, likely based on embeddings or a language model itself, to compare samples from a training dataset against a benchmark dataset. It quantifies the degree of rephrasing and contamination, allowing users to estimate the percentage of overlapping data and subsequently filter out these potentially problematic samples from the training set.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n llm-detect python=3.9), activate it (conda activate llm-detect), and install dependencies (pip install -r requirement.txt).
  • Prerequisites: Python 3.9, datasets library, OpenAI API key (for end-to-end execution).
  • Data Format: Training and test sets should be in JSONL format, with each line containing {"text": data}.
  • Example: The README provides a script to load and preprocess data from the Hugging Face datasets library.
  • End-to-End Execution: python3 main.py --train_path <train_data.jsonl> --test_path <test_data.jsonl> --output_path <output_db.jsonl> --data-type code --top_k 1 (requires OPENAI_API_KEY environment variable).
  • Resources: Processing 500k samples from StarCoder-Data is demonstrated.

Highlighted Details

  • Quantifies rephrased samples relative to benchmarks like HumanEval, MATH Test, and MMLU.
  • Provides contamination percentages for various real-world datasets.
  • Includes scripts to reproduce F1 scores reported in the associated paper (Table 5 & 6).
  • Offers code for dataset preprocessing and end-to-end contamination detection.

Maintenance & Community

The project is associated with the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples" by Shuo Yang et al. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

The end-to-end execution relies on an OpenAI API key, which may incur costs and introduce external dependencies. The effectiveness of contamination detection is dependent on the chosen similarity metric and parameters (top_k).

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.