MixEval by JinjieNi

Dynamic LLM evaluation suite for accurate, cost-effective benchmarking

Created 1 year ago

253 stars

Top 99.3% on SourcePulse

View on GitHub

3 Experts Love This Project

Yaowei Zheng

Author of LLaMA-Factory

Maxime Labonne

Head of Post-Training at Liquid AI

Wing Lian

Founder of Axolotl AI

Project Summary

Summary

MixEval offers a dynamic, ground-truth-based benchmark suite for evaluating Large Language Models (LLMs). It addresses the limitations of static, expensive, and potentially contaminated benchmarks by providing a cost-effective, reproducible, and continuously updated evaluation framework. Designed for researchers and practitioners, MixEval achieves highly accurate model ranking, correlating strongly with human preference benchmarks like Chatbot Arena, while significantly reducing evaluation time and cost.

How It Works

The core of MixEval is its dynamic benchmarking approach, blending existing LLM benchmarks with real-world user queries mined from the web. This mixture is periodically updated using a fast, stable pipeline to mitigate contamination and ensure relevance. Evaluation employs stable model parsers, typically GPT-3.5-Turbo or open-source LLMs, offering greater reliability than traditional rule-based methods. The suite includes MixEval and MixEval-Hard, each with free-form and multiple-choice formats, designed for comprehensive and less biased query distribution.

Quick Start & Requirements

Installation involves cloning the repository, setting up a Python 3.11 environment via Conda, and running setup.sh. An OpenAI API key is required for the default model parser, though open-source parsers are supported. Evaluation is initiated via a Python command, specifying model, benchmark, version, and resource allocation (e.g., --batch_size, --max_gpu_memory). Links to the homepage, leaderboard, and arXiv paper are provided.

Highlighted Details

Achieves a 0.96 correlation with Chatbot Arena Elo, indicating accurate model ranking.
Offers significant cost and time savings, estimated at 6% of MMLU evaluation time and a fraction of Chatbot Arena's cost.
Features dynamic data updates, with queries refreshed monthly to prevent contamination.
Supports evaluation of local checkpoints and custom model registration.
MixEval-X, an any-to-any benchmark, has been released.

Maintenance & Community

The project is actively maintained, with recent news highlighting support for local model parsers and the release of MixEval-X. It was accepted to Neurips 2024. Notable contributors are listed, and links to the project's homepage, blog, and Twitter are available.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README. It is designed for compatibility with both open-source and proprietary LLMs, and allows users to integrate their own evaluation code.

Limitations & Caveats

A primary caveat is the lack of explicit licensing information, potentially hindering commercial adoption. The default model parser relies on an OpenAI API key, introducing an external dependency and associated costs.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days