Benchmark for graduate-level, Google-proof question answering
Top 76.5% on sourcepulse
This repository provides the GPQA benchmark, a dataset of graduate-level questions designed to be resistant to simple web searches. It includes baselines and analysis for evaluating large language models on challenging academic questions, targeting researchers and developers in NLP and AI.
How It Works
The project evaluates LLMs using various prompting strategies, including zero-shot, few-shot, chain-of-thought, and retrieval-augmented methods. Retrieval baselines leverage Bing search snippets and scraped web content to provide context, aiming to simulate more robust question-answering capabilities.
Quick Start & Requirements
pip install -r requirements.txt
.dataset.zip
(password: deserted-untie-orchid
) or use the Hugging Face version: https://huggingface.co/datasets/idavidrein/gpqa.python baselines/run_baseline.py main --model_name <model> --data_filename <path> --prompt_type <type>
. See README for full command examples.Highlighted Details
Maintenance & Community
The project is associated with authors from prominent institutions, indicating a strong academic backing. Further community engagement details are not explicitly provided in the README.
Licensing & Compatibility
The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.
Limitations & Caveats
Currently, only OpenAI models GPT-3.5-turbo-16k-0613 and GPT-4 are implemented. The dataset requires a password for download.
10 months ago
Inactive