Discover and explore top open-source AI tools and projects—updated daily.
SeekingDreamSurvey on LLM benchmark evolution and data contamination
Top 58.5% on SourcePulse
This repository surveys recent advancements in Large Language Model (LLM) benchmarking, focusing on the critical issue of data contamination. It transitions from static to dynamic evaluation methods, offering a guide for researchers and practitioners to enhance the reliability and trustworthiness of LLM performance assessments by mitigating risks associated with training data leakage.
How It Works
The project systematically analyzes methods for enhancing static benchmarks, identifying their inherent limitations. It then highlights a critical gap in standardized criteria for dynamic benchmarks, proposing optimal design principles. The approach involves a comprehensive review and taxonomy of static and dynamic benchmarking strategies, including data encryption, post-hoc detection, and various dynamic generation techniques (e.g., temporal cutoff, rule-based, LLM-based, interactive, multi-agent). This provides a structured overview to guide future research and standardization efforts in contamination-free LLM evaluation.
Quick Start & Requirements
This repository serves as a survey and curated list of research papers, datasets, and code related to LLM benchmarking and data contamination. It does not provide a direct installation or quick-start command for a runnable tool. Links to papers and associated code are provided throughout the document for individual projects.
Highlighted Details
Maintenance & Community
The repository is actively maintained, with a commitment to incorporating new research and welcoming community contributions via pull requests and issues. Suggestions for taxonomy updates or reporting new preprints are encouraged.
Licensing & Compatibility
The README does not specify a software license.
Limitations & Caveats
The survey highlights the inherent limitations of static benchmarks and the current lack of standardized criteria for dynamic benchmarks. It also discusses the limitations of existing dynamic benchmark approaches. The repository itself is a research survey, not a deployable tool with its own operational limitations.
5 months ago
Inactive
redotvideo
confident-ai