TruthfulQA  by sylinrl

Benchmark dataset for evaluating truthfulness of language models

Created 4 years ago
809 stars

Top 43.7% on SourcePulse

GitHubView on GitHub
Project Summary

This repository provides the TruthfulQA benchmark for measuring how well language models imitate human falsehoods. It offers a dataset of questions and reference answers, along with evaluation scripts for both generation and multiple-choice tasks, targeting researchers and developers evaluating LLM truthfulness.

How It Works

TruthfulQA evaluates models on their tendency to generate truthful answers, even when common misconceptions exist. It offers two primary tasks: a generation task where models produce short answers and a multiple-choice task where models select the correct answer from options. Evaluation metrics include GPT-3 based judges (GPT-judge, GPT-info) for truthfulness and informativeness, and similarity metrics like BLEURT, ROUGE, and BLEU, comparing model outputs to true and false references.

Quick Start & Requirements

  • Install via pip: pip install -r requirements.txt and pip install -e .
  • Requires PyTorch with CUDA for GPU acceleration.
  • Supports models like GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA.
  • For GPT-3 metrics, OpenAI API access and fine-tuned models are needed.
  • A Colab notebook is available for easier GPU-based execution.
  • Official Docs

Highlighted Details

  • Offers a new, recommended multiple-choice setting with binary options.
  • Includes GPT-3 based metrics (GPT-judge, GPT-info) with high validation accuracy for human judgment prediction.
  • Provides fine-tuning datasets for GPT-3 evaluation metrics.
  • Updated dataset with more reference answers and removed the "Indexical Error: Time" category.

Maintenance & Community

  • Authors: Stephanie Lin (University of Oxford), Jacob Hilton (OpenAI), Owain Evans (University of Oxford).
  • Last updated January 2025 with a new multiple-choice setting and dataset fixes.

Licensing & Compatibility

  • The repository itself appears to be under a permissive license, but the dataset's licensing is not explicitly stated in the README.
  • Commercial use compatibility would depend on the dataset license.

Limitations & Caveats

  • The README mentions that older versions of the dataset may contain outdated information due to question updates.
  • Full validation of the remaining questions after updates is noted as incomplete.
  • GPT-3 metrics require OpenAI API access and fine-tuning, which may not be universally available.
Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Travis Fischer Travis Fischer(Founder of Agentic).

long-form-factuality by google-deepmind

0.2%
640
Benchmark for long-form factuality in LLMs
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.