gpqa by idavidrein

Benchmark for graduate-level, Google-proof question answering

Created 3 years ago

455 stars

Top 66.3% on SourcePulse

Project Summary

This repository provides the GPQA benchmark, a dataset of graduate-level questions designed to be resistant to simple web searches. It includes baselines and analysis for evaluating large language models on challenging academic questions, targeting researchers and developers in NLP and AI.

How It Works

The project evaluates LLMs using various prompting strategies, including zero-shot, few-shot, chain-of-thought, and retrieval-augmented methods. Retrieval baselines leverage Bing search snippets and scraped web content to provide context, aiming to simulate more robust question-answering capabilities.

Quick Start & Requirements

Install: Create a Python 3.9 virtual environment and run pip install -r requirements.txt.
Prerequisites: OpenAI API key (required), Bing Search V7 Subscription Key (for open-book baselines).
Dataset: Download dataset.zip (password: deserted-untie-orchid) or use the Hugging Face version: https://huggingface.co/datasets/idavidrein/gpqa.
Usage: Run evaluations with python baselines/run_baseline.py main --model_name <model> --data_filename <path> --prompt_type <type>. See README for full command examples.

Highlighted Details

Supports GPT-3.5-turbo-16k-0613 and GPT-4 models.
Includes retrieval methods that incorporate Bing search snippets and scraped web content.
Provides scripts for training answer-only baselines (T5-based span-identification and CBOW).
Dataset contains a unique canary string for detection.

Maintenance & Community

The project is associated with authors from prominent institutions, indicating a strong academic backing. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

Currently, only OpenAI models GPT-3.5-turbo-16k-0613 and GPT-4 are implemented. The dataset requires a password for download.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

14 stars in the last 30 days