Static-to-Dynamic-LLMEval  by SeekingDream

Survey on LLM benchmark evolution and data contamination

Created 1 year ago
545 stars

Top 58.5% on SourcePulse

GitHubView on GitHub
Project Summary

This repository surveys recent advancements in Large Language Model (LLM) benchmarking, focusing on the critical issue of data contamination. It transitions from static to dynamic evaluation methods, offering a guide for researchers and practitioners to enhance the reliability and trustworthiness of LLM performance assessments by mitigating risks associated with training data leakage.

How It Works

The project systematically analyzes methods for enhancing static benchmarks, identifying their inherent limitations. It then highlights a critical gap in standardized criteria for dynamic benchmarks, proposing optimal design principles. The approach involves a comprehensive review and taxonomy of static and dynamic benchmarking strategies, including data encryption, post-hoc detection, and various dynamic generation techniques (e.g., temporal cutoff, rule-based, LLM-based, interactive, multi-agent). This provides a structured overview to guide future research and standardization efforts in contamination-free LLM evaluation.

Quick Start & Requirements

This repository serves as a survey and curated list of research papers, datasets, and code related to LLM benchmarking and data contamination. It does not provide a direct installation or quick-start command for a runnable tool. Links to papers and associated code are provided throughout the document for individual projects.

Highlighted Details

  • Comprehensive taxonomy of static and dynamic benchmarking methods for LLM evaluation.
  • Proposes optimal design principles for dynamic benchmarking to address standardization gaps.
  • Analyzes limitations of existing static and dynamic benchmarks.
  • Covers diverse application areas including math, coding, reasoning, safety, and language understanding.

Maintenance & Community

The repository is actively maintained, with a commitment to incorporating new research and welcoming community contributions via pull requests and issues. Suggestions for taxonomy updates or reporting new preprints are encouraged.

Licensing & Compatibility

The README does not specify a software license.

Limitations & Caveats

The survey highlights the inherent limitations of static benchmarks and the current lack of standardized criteria for dynamic benchmarks. It also discusses the limitations of existing dynamic benchmark approaches. The repository itself is a research survey, not a deployable tool with its own operational limitations.

Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
147 stars in the last 30 days

Explore Similar Projects

Starred by Nir Gazit Nir Gazit(Cofounder of Traceloop), Jared Palmer Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

haven by redotvideo

0%
348
LLM fine-tuning and evaluation platform
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.