WildBench by allenai

Benchmarking LLMs with challenging real-user tasks

Created 2 years ago

251 stars

Top 99.9% on SourcePulse

View on GitHub

2 Experts Love This Project

Wing Lian

Founder of Axolotl AI

Nathan Lambert

Research Scientist at AI2

Project Summary

Summary

WildBench provides a benchmark for evaluating Large Language Models (LLMs) using challenging, real-world user prompts. It targets researchers and practitioners needing to assess LLM capabilities beyond standard academic datasets, offering a framework for robust, interpretable, and less biased performance measurement.

How It Works

The framework employs checklists generated by advanced LLMs (GPT-4-turbo, Claude-3-Opus) for task-specific evaluation. Responses are scored using a WB Score (1-10 scale, re-scaled) or compared via WB Reward, where GPT-4-turbo judges relative performance. A novel length penalty mechanism mitigates judge bias towards longer outputs, enhancing metric reliability.

Quick Start & Requirements

Installation: conda create -n zeroeval python=3.10 && conda activate zeroeval && pip install vllm==0.5.1 && pip install -r requirements.txt.
Prerequisites: Python 3.10, vLLM (v0.5.1), CUDA recommended for GPU inference.
Running Models: Utilize scripts/_common_vllm.sh [hf_model_id] [model_pretty_name] [num_gpus] for vLLM-supported models.
Evaluation: Primarily uses OpenAI's Batch Mode (evaluation/run_score_eval_batch.sh).
Links: Paper available at https://arxiv.org/abs/2406.04770. Detailed evaluation steps in EVAL.md.

Highlighted Details

Maintains a public leaderboard of LLM performance.
WB Score and WB Reward metrics are designed for interpretability.
WB Reward-Mix demonstrates high correlation with human preferences (Chatbot Arena Elo).
Supports evaluation via OpenAI Batch API for efficiency.

Maintenance & Community

Developed by Allen AI, the project encourages community contributions via GitHub Issues for adding new models. Ongoing development is indicated by update lists for code and leaderboard features.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking requires clarification on licensing terms.

Limitations & Caveats

The primary evaluation pipeline relies on vLLM compatibility and OpenAI's batch API, incurring costs and potential delays. Support for models not compatible with vLLM requires custom engine implementation (--engine hf, --engine openai). Ongoing development means some features or model integrations may still be in progress.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days