Discover and explore top open-source AI tools and projects—updated daily.
allenaiBenchmarking LLMs with challenging real-user tasks
Top 99.9% on SourcePulse
Summary
WildBench provides a benchmark for evaluating Large Language Models (LLMs) using challenging, real-world user prompts. It targets researchers and practitioners needing to assess LLM capabilities beyond standard academic datasets, offering a framework for robust, interpretable, and less biased performance measurement.
How It Works
The framework employs checklists generated by advanced LLMs (GPT-4-turbo, Claude-3-Opus) for task-specific evaluation. Responses are scored using a WB Score (1-10 scale, re-scaled) or compared via WB Reward, where GPT-4-turbo judges relative performance. A novel length penalty mechanism mitigates judge bias towards longer outputs, enhancing metric reliability.
Quick Start & Requirements
conda create -n zeroeval python=3.10 && conda activate zeroeval && pip install vllm==0.5.1 && pip install -r requirements.txt.scripts/_common_vllm.sh [hf_model_id] [model_pretty_name] [num_gpus] for vLLM-supported models.evaluation/run_score_eval_batch.sh).https://arxiv.org/abs/2406.04770. Detailed evaluation steps in EVAL.md.Highlighted Details
Maintenance & Community
Developed by Allen AI, the project encourages community contributions via GitHub Issues for adding new models. Ongoing development is indicated by update lists for code and leaderboard features.
Licensing & Compatibility
The specific open-source license is not explicitly stated in the provided README content. Compatibility for commercial use or closed-source linking requires clarification on licensing terms.
Limitations & Caveats
The primary evaluation pipeline relies on vLLM compatibility and OpenAI's batch API, incurring costs and potential delays. Support for models not compatible with vLLM requires custom engine implementation (--engine hf, --engine openai). Ongoing development means some features or model integrations may still be in progress.
1 year ago
Inactive
redotvideo
JinjieNi