bullshit-benchmark by petergpt

AI benchmark for critical response evaluation

Created 3 months ago

1,705 stars

Top 24.4% on SourcePulse

View on GitHub

3 Experts Love This Project

Wei-Lin Chiang

Cofounder of LMArena

Alex Atallah

Cofounder of OpenRouter, OpenSea

Anastasios Angelopoulos

Cofounder of LMArena

Project Summary

Summary

Bullshit Benchmark addresses the critical need to evaluate AI models' ability to identify and reject nonsensical or flawed prompts, rather than confidently generating plausible-sounding but incorrect information. Created for AI researchers and developers, it provides a standardized methodology and dataset to measure an AI's robustness against deceptive inputs, fostering more reliable and trustworthy AI systems.

How It Works

The benchmark employs a judge-model panel grading system. Responses to a curated set of nonsensical prompts (from questions.json) are collected from target AI models. These responses are then evaluated by a panel, categorizing them into "Clear Pushback," "Partial Challenge," or "Accepted Nonsense." The core functionality is managed by a Python CLI (scripts/openrouter_benchmark.py), orchestrating data collection, grading, aggregation, and reporting.

Quick Start & Requirements

Primary Install/Run: Execute the full pipeline with ./scripts/run_end_to_end.sh.
Prerequisites: Requires an OPENROUTER_API_KEY environment variable. OPENROUTER_REFERER and OPENROUTER_APP_NAME are optional.
Links: Public Viewer: https://petergpt.github.io/bullshit-benchmark/viewer/index.html.
Setup: Running the end-to-end script initiates data collection, grading, and publishing. Local serving is available via ./scripts/run_end_to_end.sh --serve --port 8877.

Highlighted Details

Categorizes model responses into "Clear Pushback," "Partial Challenge," and "Accepted Nonsense."
Includes recent published snapshots featuring OpenAI's gpt-5.2-codex and gpt-5.3-codex variants across different reasoning levels.
Provides an interactive viewer (viewer/index.html) for exploring benchmark results and datasets.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are present in the provided README snippet.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The benchmark's effectiveness is contingent on the quality and comprehensiveness of the questions.json dataset and the subjective grading performed by the judge-model panel. Reliance on the OpenRouter API necessitates an active API key and may incur associated costs. The current configuration notes suggest a focus on specific OpenAI models, potentially limiting immediate applicability to other model providers without adaptation.

Health Check

Last Commit

19 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

89 stars in the last 30 days