LiveBench  by LiveBench

LLM benchmark for evaluating models on recently released data

created 1 year ago
836 stars

Top 43.5% on sourcepulse

GitHubView on GitHub
Project Summary

LiveBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on a diverse set of tasks while actively mitigating test set contamination. It targets researchers and developers seeking objective, automated evaluation of LLM performance across reasoning, coding, math, and other domains. The benchmark's monthly question releases and reliance on objective ground truths enable robust and reliable performance measurement.

How It Works

LiveBench employs a strategy of releasing new questions monthly, incorporating recent data from arXiv, news, and IMDb to deter contamination. Each question features verifiable, objective ground-truth answers, eliminating the need for subjective LLM judges. The benchmark currently comprises 18 tasks across 6 categories, with plans for continuous expansion of more challenging tasks.

Quick Start & Requirements

  • Install: pip install -e . (for API models) or pip install -e .[flash_attn] (for local GPU inference, requires pip uninstall fschat first).
  • Local model inference is unmaintained; recommend using vLLM with an OpenAI-compatible API.
  • Evaluation: python run_livebench.py --model <model_name> --bench-name <subset>
  • View results: python show_livebench_result.py --bench-name <subset> --model-list <models>
  • Documentation: https://github.com/LiveBench/LiveBench

Highlighted Details

  • Monthly question releases to combat data contamination.
  • Objective, automatically verifiable ground-truth answers.
  • 18 diverse tasks across 6 categories (Reasoning, Math, Coding, Language, Data Analysis, Instruction Following).
  • Supports evaluation of API-based models and local models via OpenAI-compatible endpoints.

Maintenance & Community

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

Local model inference via gen_model_answer.py is unmaintained. Some models may encounter errors due to content filters, which are treated as incorrect responses. Evaluating the full benchmark requires passing --livebench-release-option 2024-11-25 to access the most recent public questions.

Health Check
Last commit

1 day ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
16
Star History
144 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

LiveCodeBench by LiveCodeBench

0.8%
606
Benchmark for holistic LLM code evaluation
created 1 year ago
updated 2 weeks ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Simon Willison Simon Willison(Author of Django), and
9 more.

simple-evals by openai

0.4%
4k
Lightweight library for evaluating language models
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.