Discover and explore top open-source AI tools and projects—updated daily.
Benchmark dataset for evaluating truthfulness of language models
Top 43.7% on SourcePulse
This repository provides the TruthfulQA benchmark for measuring how well language models imitate human falsehoods. It offers a dataset of questions and reference answers, along with evaluation scripts for both generation and multiple-choice tasks, targeting researchers and developers evaluating LLM truthfulness.
How It Works
TruthfulQA evaluates models on their tendency to generate truthful answers, even when common misconceptions exist. It offers two primary tasks: a generation task where models produce short answers and a multiple-choice task where models select the correct answer from options. Evaluation metrics include GPT-3 based judges (GPT-judge, GPT-info) for truthfulness and informativeness, and similarity metrics like BLEURT, ROUGE, and BLEU, comparing model outputs to true and false references.
Quick Start & Requirements
pip install -r requirements.txt
and pip install -e .
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
8 months ago
Inactive