LiveBench  by LiveBench

LLM benchmark for evaluating models on recently released data

Created 1 year ago
873 stars

Top 41.1% on SourcePulse

GitHubView on GitHub
Project Summary

LiveBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on a diverse set of tasks while actively mitigating test set contamination. It targets researchers and developers seeking objective, automated evaluation of LLM performance across reasoning, coding, math, and other domains. The benchmark's monthly question releases and reliance on objective ground truths enable robust and reliable performance measurement.

How It Works

LiveBench employs a strategy of releasing new questions monthly, incorporating recent data from arXiv, news, and IMDb to deter contamination. Each question features verifiable, objective ground-truth answers, eliminating the need for subjective LLM judges. The benchmark currently comprises 18 tasks across 6 categories, with plans for continuous expansion of more challenging tasks.

Quick Start & Requirements

  • Install: pip install -e . (for API models) or pip install -e .[flash_attn] (for local GPU inference, requires pip uninstall fschat first).
  • Local model inference is unmaintained; recommend using vLLM with an OpenAI-compatible API.
  • Evaluation: python run_livebench.py --model <model_name> --bench-name <subset>
  • View results: python show_livebench_result.py --bench-name <subset> --model-list <models>
  • Documentation: https://github.com/LiveBench/LiveBench

Highlighted Details

  • Monthly question releases to combat data contamination.
  • Objective, automatically verifiable ground-truth answers.
  • 18 diverse tasks across 6 categories (Reasoning, Math, Coding, Language, Data Analysis, Instruction Following).
  • Supports evaluation of API-based models and local models via OpenAI-compatible endpoints.

Maintenance & Community

Licensing & Compatibility

  • The README does not explicitly state a license.

Limitations & Caveats

Local model inference via gen_model_answer.py is unmaintained. Some models may encounter errors due to content filters, which are treated as incorrect responses. Evaluating the full benchmark requires passing --livebench-release-option 2024-11-25 to access the most recent public questions.

Health Check
Last Commit

2 days ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
7
Star History
25 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Edward Z. Yang Edward Z. Yang(Research Engineer at Meta; Maintainer of PyTorch), and
5 more.

yet-another-applied-llm-benchmark by carlini

0.2%
1k
LLM benchmark for evaluating models on previously asked programming questions
Created 1 year ago
Updated 4 months ago
Feedback? Help us improve.