Discover and explore top open-source AI tools and projects—updated daily.
petergptAI benchmark for critical response evaluation
New!
Top 68.3% on SourcePulse
Summary
Bullshit Benchmark addresses the critical need to evaluate AI models' ability to identify and reject nonsensical or flawed prompts, rather than confidently generating plausible-sounding but incorrect information. Created for AI researchers and developers, it provides a standardized methodology and dataset to measure an AI's robustness against deceptive inputs, fostering more reliable and trustworthy AI systems.
How It Works
The benchmark employs a judge-model panel grading system. Responses to a curated set of nonsensical prompts (from questions.json) are collected from target AI models. These responses are then evaluated by a panel, categorizing them into "Clear Pushback," "Partial Challenge," or "Accepted Nonsense." The core functionality is managed by a Python CLI (scripts/openrouter_benchmark.py), orchestrating data collection, grading, aggregation, and reporting.
Quick Start & Requirements
./scripts/run_end_to_end.sh.OPENROUTER_API_KEY environment variable. OPENROUTER_REFERER and OPENROUTER_APP_NAME are optional.https://petergpt.github.io/bullshit-benchmark/viewer/index.html../scripts/run_end_to_end.sh --serve --port 8877.Highlighted Details
gpt-5.2-codex and gpt-5.3-codex variants across different reasoning levels.viewer/index.html) for exploring benchmark results and datasets.Maintenance & Community
No specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmap are present in the provided README snippet.
Licensing & Compatibility
The license type is not explicitly stated in the provided README content. This omission requires further investigation for commercial use or integration into closed-source projects.
Limitations & Caveats
The benchmark's effectiveness is contingent on the quality and comprehensiveness of the questions.json dataset and the subjective grading performed by the judge-model panel. Reliance on the OpenRouter API necessitates an active API key and may incur associated costs. The current configuration notes suggest a focus on specific OpenAI models, potentially limiting immediate applicability to other model providers without adaptation.
1 day ago
Inactive
braintrustdata
groq
lmarena
mshumer