LLM benchmark for evaluating models on recently released data
Top 43.5% on sourcepulse
LiveBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on a diverse set of tasks while actively mitigating test set contamination. It targets researchers and developers seeking objective, automated evaluation of LLM performance across reasoning, coding, math, and other domains. The benchmark's monthly question releases and reliance on objective ground truths enable robust and reliable performance measurement.
How It Works
LiveBench employs a strategy of releasing new questions monthly, incorporating recent data from arXiv, news, and IMDb to deter contamination. Each question features verifiable, objective ground-truth answers, eliminating the need for subjective LLM judges. The benchmark currently comprises 18 tasks across 6 categories, with plans for continuous expansion of more challenging tasks.
Quick Start & Requirements
pip install -e .
(for API models) or pip install -e .[flash_attn]
(for local GPU inference, requires pip uninstall fschat
first).python run_livebench.py --model <model_name> --bench-name <subset>
python show_livebench_result.py --bench-name <subset> --model-list <models>
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
Local model inference via gen_model_answer.py
is unmaintained. Some models may encounter errors due to content filters, which are treated as incorrect responses. Evaluating the full benchmark requires passing --livebench-release-option 2024-11-25
to access the most recent public questions.
1 day ago
1 week