TruthfulQA by sylinrl

Benchmark dataset for evaluating truthfulness of language models

Created 4 years ago

864 stars

Top 41.6% on SourcePulse

4 Experts Love This Project

winglian

Founder of Axolotl AI

hammer

Jeff Hammerbacher

Cofounder of Cloudera

evhub

Head of Alignment Stress-Testing at Anthropic

collin-burns

Author of MMLU; MTS at Anthropic

Project Summary

This repository provides the TruthfulQA benchmark for measuring how well language models imitate human falsehoods. It offers a dataset of questions and reference answers, along with evaluation scripts for both generation and multiple-choice tasks, targeting researchers and developers evaluating LLM truthfulness.

How It Works

TruthfulQA evaluates models on their tendency to generate truthful answers, even when common misconceptions exist. It offers two primary tasks: a generation task where models produce short answers and a multiple-choice task where models select the correct answer from options. Evaluation metrics include GPT-3 based judges (GPT-judge, GPT-info) for truthfulness and informativeness, and similarity metrics like BLEURT, ROUGE, and BLEU, comparing model outputs to true and false references.

Quick Start & Requirements

Install via pip: pip install -r requirements.txt and pip install -e .
Requires PyTorch with CUDA for GPU acceleration.
Supports models like GPT-3, GPT-Neo/J, GPT-2, and UnifiedQA.
For GPT-3 metrics, OpenAI API access and fine-tuned models are needed.
A Colab notebook is available for easier GPU-based execution.
Official Docs

Highlighted Details

Offers a new, recommended multiple-choice setting with binary options.
Includes GPT-3 based metrics (GPT-judge, GPT-info) with high validation accuracy for human judgment prediction.
Provides fine-tuning datasets for GPT-3 evaluation metrics.
Updated dataset with more reference answers and removed the "Indexical Error: Time" category.

Maintenance & Community

Authors: Stephanie Lin (University of Oxford), Jacob Hilton (OpenAI), Owain Evans (University of Oxford).
Last updated January 2025 with a new multiple-choice setting and dataset fixes.

Licensing & Compatibility

The repository itself appears to be under a permissive license, but the dataset's licensing is not explicitly stated in the README.
Commercial use compatibility would depend on the dataset license.

Limitations & Caveats

The README mentions that older versions of the dataset may contain outdated information due to question updates.
Full validation of the remaining questions after updates is noted as incomplete.
GPT-3 metrics require OpenAI API access and fine-tuning, which may not be universally available.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

1

Issues (30d)

0

Star History

12 stars in the last 30 days

Explore Similar Projects

llm_benchmark by llm2014

LLM evaluation benchmark tracking model evolution

Created 11 months ago

Updated 1 day ago

LEval by OpenLMLab

Benchmark for long-context language model evaluation

Created 2 years ago

Updated 1 year ago

LLM-Factuality-Survey by wangcunxiang

Survey paper on factuality in large language models

Created 2 years ago

Updated 1 year ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang),

Casper Hansen

Casper Hansen(Author of AutoAWQ), and

1 more.

InfiniteBench by OpenBMB

Benchmark for evaluating language models on super-long contexts (100k+ tokens)

Created 2 years ago

Updated 1 year ago

RGB by chen700564

Benchmark for LLM evaluation in Retrieval-Augmented Generation

Created 2 years ago

Updated 1 year ago

geval by nlpyang

GPT-4-based evaluation code for NLG, per research paper

Created 2 years ago

Updated 1 year ago

Qwen2-Boundless by ystemsrx

Fine-tuned language model for handling sensitive topics

Created 1 year ago

Updated 1 year ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory).

GAOKAO-Bench by OpenLMLab

Evaluation framework for assessing LLMs using Chinese GAOKAO (college entrance exam) questions

Created 2 years ago

Updated 1 year ago

Starred by

Elvis Saravia

Elvis Saravia(Founder of DAIR.AI).

TrustLLM by HowieHwong

Trustworthiness benchmark for large language models (ICML 2024)

Created 2 years ago

Updated 6 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and

Travis Fischer

Travis Fischer(Founder of Agentic).

long-form-factuality by google-deepmind

Benchmark for long-form factuality in LLMs

Created 1 year ago

Updated 3 days ago

Starred by

Zack Li

Zack Li(Cofounder of Nexa AI),

Simon Willison

Simon Willison(Coauthor of Django), and

6 more.

test by hendrycks

Research paper for measuring multitask language understanding

Created 5 years ago

Updated 2 years ago

Starred by

Yaowei Zheng

Yaowei Zheng(Author of LLaMA-Factory),

Pawel Garbacki

Pawel Garbacki(Cofounder of Fireworks AI), and

9 more.

arena-hard-auto by lmarena

Automatic LLM benchmark for instruction-tuned models, correlating with human preference

Created 2 years ago

Updated 6 months ago

Feedback? Help us improve.