This repository provides the TruthfulQA benchmark for measuring how well language models imitate human falsehoods. It offers a dataset of questions and reference answers, along with evaluation scripts for both generation and multiple-choice tasks, targeting researchers and developers evaluating LLM truthfulness.
How It Works
TruthfulQA evaluates models on their tendency to generate truthful answers, even when common misconceptions exist. It offers two primary tasks: a generation task where models produce short answers and a multiple-choice task where models select the correct answer from options. Evaluation metrics include GPT-3 based judges (GPT-judge, GPT-info) for truthfulness and informativeness, and similarity metrics like BLEURT, ROUGE, and BLEU, comparing model outputs to true and false references.
Quick Start & Requirements
- Install via pip:
pip install -r requirements.txt
and pip install -e .
- Requires PyTorch with CUDA for GPU acceleration.
- Supports models like GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA.
- For GPT-3 metrics, OpenAI API access and fine-tuned models are needed.
- A Colab notebook is available for easier GPU-based execution.
- Official Docs
Highlighted Details
- Offers a new, recommended multiple-choice setting with binary options.
- Includes GPT-3 based metrics (GPT-judge, GPT-info) with high validation accuracy for human judgment prediction.
- Provides fine-tuning datasets for GPT-3 evaluation metrics.
- Updated dataset with more reference answers and removed the "Indexical Error: Time" category.
Maintenance & Community
- Authors: Stephanie Lin (University of Oxford), Jacob Hilton (OpenAI), Owain Evans (University of Oxford).
- Last updated January 2025 with a new multiple-choice setting and dataset fixes.
Licensing & Compatibility
- The repository itself appears to be under a permissive license, but the dataset's licensing is not explicitly stated in the README.
- Commercial use compatibility would depend on the dataset license.
Limitations & Caveats
- The README mentions that older versions of the dataset may contain outdated information due to question updates.
- Full validation of the remaining questions after updates is noted as incomplete.
- GPT-3 metrics require OpenAI API access and fine-tuning, which may not be universally available.