AidanBench  by aidanmclaughlin

LLM benchmark for evaluating creativity and open-ended question answering

created 1 year ago
307 stars

Top 88.4% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

AidanBench is an open-ended question benchmark designed to stress-test the creativity, reliability, and contextual attention of Large Language Models (LLMs). It targets researchers and developers evaluating LLMs beyond traditional accuracy metrics, offering a method to penalize mode collapse and inflexibility in generative AI.

How It Works

AidanBench evaluates LLMs by posing open-ended questions across diverse domains and prompting the model to generate novel answers while avoiding repetition. The benchmark continues generating responses until they become incoherent (coherence score $\leq$ 15/100) or too semantically similar to previous outputs (novelty score $\leq$ 0.15). Scoring is based on the maximum number of valid responses generated per question, with coherence assessed by a judge model (o1-mini) and novelty calculated using response embeddings.

Quick Start & Requirements

  • Install: git clone https://github.com/aidanmclaughlin/Aidan-Bench.git, cd Aidan-Bench, pip install numpy openai colorama retry
  • Prerequisites: Python 3.x, OpenAI API key, OpenRouter API key.
  • Setup: Set OPENAI_API_KEY and OPEN_ROUTER_KEY environment variables.
  • Run: python benchmark/main.py
  • Visualization: cd visualize_results, python -m http.server 8000, then open http://localhost:8000/visualization.
  • Docs: https://github.com/aidanmclaughlin/Aidan-Bench

Highlighted Details

  • Evaluates LLM creativity, reliability, contextual attention, and instruction following.
  • Penalizes mode collapse and inflexibility, with no score ceiling.
  • Utilizes a judge model (o1-mini) for coherence scoring and response embeddings for novelty.
  • Supports benchmarking multiple models and configuring test parameters like temperature and thresholds.

Maintenance & Community

The project is associated with Aidan McLaughlin, James Campbell, Anuja Uppuluri, and Yiming Yang, with contributions noted as equal. It is slated for presentation at the NeurIPS 2024 Workshop on Language Gamification.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial or closed-source integration.

Limitations & Caveats

The benchmark relies on external API keys (OpenAI, OpenRouter) and a judge model for scoring, introducing potential costs and dependencies. The effectiveness of the novelty score is dependent on the quality of the response embeddings.

Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
11 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Simon Willison Simon Willison(Author of Django), and
9 more.

simple-evals by openai

0.4%
4k
Lightweight library for evaluating language models
created 1 year ago
updated 3 weeks ago
Feedback? Help us improve.