Discover and explore top open-source AI tools and projects—updated daily.
sylinrlBenchmark dataset for evaluating truthfulness of language models
Top 41.6% on SourcePulse
This repository provides the TruthfulQA benchmark for measuring how well language models imitate human falsehoods. It offers a dataset of questions and reference answers, along with evaluation scripts for both generation and multiple-choice tasks, targeting researchers and developers evaluating LLM truthfulness.
How It Works
TruthfulQA evaluates models on their tendency to generate truthful answers, even when common misconceptions exist. It offers two primary tasks: a generation task where models produce short answers and a multiple-choice task where models select the correct answer from options. Evaluation metrics include GPT-3 based judges (GPT-judge, GPT-info) for truthfulness and informativeness, and similarity metrics like BLEURT, ROUGE, and BLEU, comparing model outputs to true and false references.
Quick Start & Requirements
pip install -r requirements.txt and pip install -e .Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 year ago
Inactive
google-deepmind
lmarena