WildBench  by allenai

Benchmarking LLMs with challenging real-user tasks

Created 2 years ago
251 stars

Top 99.9% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

WildBench provides a benchmark for evaluating Large Language Models (LLMs) using challenging, real-world user prompts. It targets researchers and practitioners needing to assess LLM capabilities beyond standard academic datasets, offering a framework for robust, interpretable, and less biased performance measurement.

How It Works

The framework employs checklists generated by advanced LLMs (GPT-4-turbo, Claude-3-Opus) for task-specific evaluation. Responses are scored using a WB Score (1-10 scale, re-scaled) or compared via WB Reward, where GPT-4-turbo judges relative performance. A novel length penalty mechanism mitigates judge bias towards longer outputs, enhancing metric reliability.

Quick Start & Requirements

  • Installation: conda create -n zeroeval python=3.10 && conda activate zeroeval && pip install vllm==0.5.1 && pip install -r requirements.txt.
  • Prerequisites: Python 3.10, vLLM (v0.5.1), CUDA recommended for GPU inference.
  • Running Models: Utilize scripts/_common_vllm.sh [hf_model_id] [model_pretty_name] [num_gpus] for vLLM-supported models.
  • Evaluation: Primarily uses OpenAI's Batch Mode (evaluation/run_score_eval_batch.sh).
  • Links: Paper available at https://arxiv.org/abs/2406.04770. Detailed evaluation steps in EVAL.md.

Highlighted Details

  • Maintains a public leaderboard of LLM performance.
  • WB Score and WB Reward metrics are designed for interpretability.
  • WB Reward-Mix demonstrates high correlation with human preferences (Chatbot Arena Elo).
  • Supports evaluation via OpenAI Batch API for efficiency.

Maintenance & Community

Developed by Allen AI, the project encourages community contributions via GitHub Issues for adding new models. Ongoing development is indicated by update lists for code and leaderboard features.

Licensing & Compatibility

The specific open-source license is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking requires clarification on licensing terms.

Limitations & Caveats

The primary evaluation pipeline relies on vLLM compatibility and OpenAI's batch API, incurring costs and potential delays. Support for models not compatible with vLLM requires custom engine implementation (--engine hf, --engine openai). Ongoing development means some features or model integrations may still be in progress.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 30 days

Explore Similar Projects

Starred by Nir Gazit Nir Gazit(Cofounder of Traceloop), Jared Palmer Jared Palmer(SVP at GitHub; Founder of Turborepo; Author of Formik, TSDX), and
3 more.

haven by redotvideo

0%
348
LLM fine-tuning and evaluation platform
Created 2 years ago
Updated 2 years ago
Feedback? Help us improve.