Static-to-Dynamic-LLMEval by SeekingDream

Survey on LLM benchmark evolution and data contamination

Created 1 year ago

508 stars

Top 61.3% on SourcePulse

Project Summary

This repository surveys recent advancements in Large Language Model (LLM) benchmarking, focusing on the critical issue of data contamination. It transitions from static to dynamic evaluation methods, offering a guide for researchers and practitioners to enhance the reliability and trustworthiness of LLM performance assessments by mitigating risks associated with training data leakage.

How It Works

The project systematically analyzes methods for enhancing static benchmarks, identifying their inherent limitations. It then highlights a critical gap in standardized criteria for dynamic benchmarks, proposing optimal design principles. The approach involves a comprehensive review and taxonomy of static and dynamic benchmarking strategies, including data encryption, post-hoc detection, and various dynamic generation techniques (e.g., temporal cutoff, rule-based, LLM-based, interactive, multi-agent). This provides a structured overview to guide future research and standardization efforts in contamination-free LLM evaluation.

Quick Start & Requirements

This repository serves as a survey and curated list of research papers, datasets, and code related to LLM benchmarking and data contamination. It does not provide a direct installation or quick-start command for a runnable tool. Links to papers and associated code are provided throughout the document for individual projects.

Highlighted Details

Comprehensive taxonomy of static and dynamic benchmarking methods for LLM evaluation.
Proposes optimal design principles for dynamic benchmarking to address standardization gaps.
Analyzes limitations of existing static and dynamic benchmarks.
Covers diverse application areas including math, coding, reasoning, safety, and language understanding.

Maintenance & Community

The repository is actively maintained, with a commitment to incorporating new research and welcoming community contributions via pull requests and issues. Suggestions for taxonomy updates or reporting new preprints are encouraged.

Licensing & Compatibility

The README does not specify a software license.

Limitations & Caveats

The survey highlights the inherent limitations of static benchmarks and the current lack of standardized criteria for dynamic benchmarks. It also discusses the limitations of existing dynamic benchmark approaches. The repository itself is a research survey, not a deployable tool with its own operational limitations.

Static-to-Dynamic-LLMEval by SeekingDream

Explore Similar Projects

haven by redotvideo

AwesomeLLM4SE by iSEngLab

web-bench by bytedance

FrugalGPT by stanford-futuredata

continuous-eval by relari-ai

bench by arthur-ai

VulnLLM-R by ucsb-mlsec

LLM4SoftwareTesting by LLM-Testing

bigcodebench by bigcode-project

awesome-data-llm by OpenDataBox

deepeval by confident-ai

ragas by vibrantlabsai