LLM benchmark for evaluating creativity and open-ended question answering
Top 88.4% on sourcepulse
AidanBench is an open-ended question benchmark designed to stress-test the creativity, reliability, and contextual attention of Large Language Models (LLMs). It targets researchers and developers evaluating LLMs beyond traditional accuracy metrics, offering a method to penalize mode collapse and inflexibility in generative AI.
How It Works
AidanBench evaluates LLMs by posing open-ended questions across diverse domains and prompting the model to generate novel answers while avoiding repetition. The benchmark continues generating responses until they become incoherent (coherence score $\leq$ 15/100) or too semantically similar to previous outputs (novelty score $\leq$ 0.15). Scoring is based on the maximum number of valid responses generated per question, with coherence assessed by a judge model (o1-mini) and novelty calculated using response embeddings.
Quick Start & Requirements
git clone https://github.com/aidanmclaughlin/Aidan-Bench.git
, cd Aidan-Bench
, pip install numpy openai colorama retry
OPENAI_API_KEY
and OPEN_ROUTER_KEY
environment variables.python benchmark/main.py
cd visualize_results
, python -m http.server 8000
, then open http://localhost:8000/visualization
.Highlighted Details
Maintenance & Community
The project is associated with Aidan McLaughlin, James Campbell, Anuja Uppuluri, and Yiming Yang, with contributions noted as equal. It is slated for presentation at the NeurIPS 2024 Workshop on Language Gamification.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source integration.
Limitations & Caveats
The benchmark relies on external API keys (OpenAI, OpenRouter) and a judge model for scoring, introducing potential costs and dependencies. The effectiveness of the novelty score is dependent on the quality of the response embeddings.
1 month ago
1 week