AidanBench by aidanmclaughlin

LLM benchmark for evaluating creativity and open-ended question answering

Created 1 year ago

317 stars

Top 85.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

AidanBench is an open-ended question benchmark designed to stress-test the creativity, reliability, and contextual attention of Large Language Models (LLMs). It targets researchers and developers evaluating LLMs beyond traditional accuracy metrics, offering a method to penalize mode collapse and inflexibility in generative AI.

How It Works

AidanBench evaluates LLMs by posing open-ended questions across diverse domains and prompting the model to generate novel answers while avoiding repetition. The benchmark continues generating responses until they become incoherent (coherence score $\leq$ 15/100) or too semantically similar to previous outputs (novelty score $\leq$ 0.15). Scoring is based on the maximum number of valid responses generated per question, with coherence assessed by a judge model (o1-mini) and novelty calculated using response embeddings.

Quick Start & Requirements

Install: git clone https://github.com/aidanmclaughlin/Aidan-Bench.git, cd Aidan-Bench, pip install numpy openai colorama retry
Prerequisites: Python 3.x, OpenAI API key, OpenRouter API key.
Setup: Set OPENAI_API_KEY and OPEN_ROUTER_KEY environment variables.
Run: python benchmark/main.py
Visualization: cd visualize_results, python -m http.server 8000, then open http://localhost:8000/visualization.
Docs: https://github.com/aidanmclaughlin/Aidan-Bench

Highlighted Details

Evaluates LLM creativity, reliability, contextual attention, and instruction following.
Penalizes mode collapse and inflexibility, with no score ceiling.
Utilizes a judge model (o1-mini) for coherence scoring and response embeddings for novelty.
Supports benchmarking multiple models and configuring test parameters like temperature and thresholds.

Maintenance & Community

The project is associated with Aidan McLaughlin, James Campbell, Anuja Uppuluri, and Yiming Yang, with contributions noted as equal. It is slated for presentation at the NeurIPS 2024 Workshop on Language Gamification.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The benchmark relies on external API keys (OpenAI, OpenRouter) and a judge model for scoring, introducing potential costs and dependencies. The effectiveness of the novelty score is dependent on the quality of the response embeddings.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days