gpqa  by idavidrein

Benchmark for graduate-level, Google-proof question answering

created 2 years ago
377 stars

Top 76.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides the GPQA benchmark, a dataset of graduate-level questions designed to be resistant to simple web searches. It includes baselines and analysis for evaluating large language models on challenging academic questions, targeting researchers and developers in NLP and AI.

How It Works

The project evaluates LLMs using various prompting strategies, including zero-shot, few-shot, chain-of-thought, and retrieval-augmented methods. Retrieval baselines leverage Bing search snippets and scraped web content to provide context, aiming to simulate more robust question-answering capabilities.

Quick Start & Requirements

  • Install: Create a Python 3.9 virtual environment and run pip install -r requirements.txt.
  • Prerequisites: OpenAI API key (required), Bing Search V7 Subscription Key (for open-book baselines).
  • Dataset: Download dataset.zip (password: deserted-untie-orchid) or use the Hugging Face version: https://huggingface.co/datasets/idavidrein/gpqa.
  • Usage: Run evaluations with python baselines/run_baseline.py main --model_name <model> --data_filename <path> --prompt_type <type>. See README for full command examples.

Highlighted Details

  • Supports GPT-3.5-turbo-16k-0613 and GPT-4 models.
  • Includes retrieval methods that incorporate Bing search snippets and scraped web content.
  • Provides scripts for training answer-only baselines (T5-based span-identification and CBOW).
  • Dataset contains a unique canary string for detection.

Maintenance & Community

The project is associated with authors from prominent institutions, indicating a strong academic backing. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking would require clarification.

Limitations & Caveats

Currently, only OpenAI models GPT-3.5-turbo-16k-0613 and GPT-4 are implemented. The dataset requires a password for download.

Health Check
Last commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
46 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.