bullshit-benchmark  by petergpt

AI benchmark for critical response evaluation

Created 1 week ago

New!

439 stars

Top 68.3% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

Bullshit Benchmark addresses the critical need to evaluate AI models' ability to identify and reject nonsensical or flawed prompts, rather than confidently generating plausible-sounding but incorrect information. Created for AI researchers and developers, it provides a standardized methodology and dataset to measure an AI's robustness against deceptive inputs, fostering more reliable and trustworthy AI systems.

How It Works

The benchmark employs a judge-model panel grading system. Responses to a curated set of nonsensical prompts (from questions.json) are collected from target AI models. These responses are then evaluated by a panel, categorizing them into "Clear Pushback," "Partial Challenge," or "Accepted Nonsense." The core functionality is managed by a Python CLI (scripts/openrouter_benchmark.py), orchestrating data collection, grading, aggregation, and reporting.

Quick Start & Requirements

  • Primary Install/Run: Execute the full pipeline with ./scripts/run_end_to_end.sh.
  • Prerequisites: Requires an OPENROUTER_API_KEY environment variable. OPENROUTER_REFERER and OPENROUTER_APP_NAME are optional.
  • Links: Public Viewer: https://petergpt.github.io/bullshit-benchmark/viewer/index.html.
  • Setup: Running the end-to-end script initiates data collection, grading, and publishing. Local serving is available via ./scripts/run_end_to_end.sh --serve --port 8877.

Highlighted Details

  • Categorizes model responses into "Clear Pushback," "Partial Challenge," and "Accepted Nonsense."
  • Includes recent published snapshots featuring OpenAI's gpt-5.2-codex and gpt-5.3-codex variants across different reasoning levels.
  • Provides an interactive viewer (viewer/index.html) for exploring benchmark results and datasets.

Maintenance & Community

No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are present in the provided README snippet.

Licensing & Compatibility

The license type is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The benchmark's effectiveness is contingent on the quality and comprehensiveness of the questions.json dataset and the subjective grading performed by the judge-model panel. Reliance on the OpenRouter API necessitates an active API key and may incur associated costs. The current configuration notes suggest a focus on specific OpenAI models, potentially limiting immediate applicability to other model providers without adaptation.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
449 stars in the last 7 days

Explore Similar Projects

Feedback? Help us improve.