LiveBench by LiveBench

LLM benchmark for evaluating models on recently released data

Created 1 year ago

1,005 stars

Top 37.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Project Summary

LiveBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on a diverse set of tasks while actively mitigating test set contamination. It targets researchers and developers seeking objective, automated evaluation of LLM performance across reasoning, coding, math, and other domains. The benchmark's monthly question releases and reliance on objective ground truths enable robust and reliable performance measurement.

How It Works

LiveBench employs a strategy of releasing new questions monthly, incorporating recent data from arXiv, news, and IMDb to deter contamination. Each question features verifiable, objective ground-truth answers, eliminating the need for subjective LLM judges. The benchmark currently comprises 18 tasks across 6 categories, with plans for continuous expansion of more challenging tasks.

Quick Start & Requirements

Install: pip install -e . (for API models) or pip install -e .[flash_attn] (for local GPU inference, requires pip uninstall fschat first).
Local model inference is unmaintained; recommend using vLLM with an OpenAI-compatible API.
Evaluation: python run_livebench.py --model <model_name> --bench-name <subset>
View results: python show_livebench_result.py --bench-name <subset> --model-list <models>
Documentation: https://github.com/LiveBench/LiveBench

Highlighted Details

Monthly question releases to combat data contamination.
Objective, automatically verifiable ground-truth answers.
18 diverse tasks across 6 categories (Reasoning, Math, Coding, Language, Data Analysis, Instruction Following).
Supports evaluation of API-based models and local models via OpenAI-compatible endpoints.

Maintenance & Community

Spotlight Paper at ICLR 2025.
Contact: livebench@livebench.ai
Links to data and leaderboard are provided.

Licensing & Compatibility

The README does not explicitly state a license.

Limitations & Caveats

Local model inference via gen_model_answer.py is unmaintained. Some models may encounter errors due to content filters, which are treated as incorrect responses. Evaluating the full benchmark requires passing --livebench-release-option 2024-11-25 to access the most recent public questions.

LiveBench by LiveBench

Explore Similar Projects

fern-platform by guidewire-oss

BIRD-CRITIC-1 by bird-bench

agent-ci by pegasi-ai

bench by arthur-ai

ollama-grid-search by dezoito

arc-agi-benchmarking by arcprize

testpilot by githubnext

yet-another-applied-llm-benchmark by carlini

openbench by groq

giskard-oss by Giskard-AI

promptfoo by promptfoo

ragas by vibrantlabsai