Discover and explore top open-source AI tools and projects—updated daily.
Dynamic LLM evaluation suite for accurate, cost-effective benchmarking
Top 100.0% on SourcePulse
Summary
MixEval offers a dynamic, ground-truth-based benchmark suite for evaluating Large Language Models (LLMs). It addresses the limitations of static, expensive, and potentially contaminated benchmarks by providing a cost-effective, reproducible, and continuously updated evaluation framework. Designed for researchers and practitioners, MixEval achieves highly accurate model ranking, correlating strongly with human preference benchmarks like Chatbot Arena, while significantly reducing evaluation time and cost.
How It Works
The core of MixEval is its dynamic benchmarking approach, blending existing LLM benchmarks with real-world user queries mined from the web. This mixture is periodically updated using a fast, stable pipeline to mitigate contamination and ensure relevance. Evaluation employs stable model parsers, typically GPT-3.5-Turbo or open-source LLMs, offering greater reliability than traditional rule-based methods. The suite includes MixEval and MixEval-Hard, each with free-form and multiple-choice formats, designed for comprehensive and less biased query distribution.
Quick Start & Requirements
Installation involves cloning the repository, setting up a Python 3.11 environment via Conda, and running setup.sh
. An OpenAI API key is required for the default model parser, though open-source parsers are supported. Evaluation is initiated via a Python command, specifying model, benchmark, version, and resource allocation (e.g., --batch_size
, --max_gpu_memory
). Links to the homepage, leaderboard, and arXiv paper are provided.
Highlighted Details
Maintenance & Community
The project is actively maintained, with recent news highlighting support for local model parsers and the release of MixEval-X. It was accepted to Neurips 2024. Notable contributors are listed, and links to the project's homepage, blog, and Twitter are available.
Licensing & Compatibility
The repository's license is not explicitly stated in the provided README. It is designed for compatibility with both open-source and proprietary LLMs, and allows users to integrate their own evaluation code.
Limitations & Caveats
A primary caveat is the lack of explicit licensing information, potentially hindering commercial adoption. The default model parser relies on an OpenAI API key, introducing an external dependency and associated costs.
11 months ago
Inactive