HaluEval by RUCAIBox

Benchmark dataset for LLM hallucination evaluation

Created 2 years ago

544 stars

Top 58.5% on SourcePulse

View on GitHub

3 Experts Love This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

HaluEval provides a large-scale benchmark for evaluating the hallucination detection capabilities of Large Language Models (LLMs). It targets researchers and developers working on LLM safety and reliability, offering a comprehensive dataset and tools to assess how well models identify fabricated information across various tasks.

How It Works

HaluEval constructs a 35K sample dataset by generating hallucinated responses for LLMs using ChatGPT. It combines 5,000 general user queries (sampled from Alpaca) with 30,000 task-specific examples from Question Answering (QA), Knowledge-Grounded Dialogue, and Text Summarization. For task-specific data, it uses existing datasets (HotpotQA, OpenDialKG, CNN/Daily Mail) as seeds, employing ChatGPT with specific instructions to create one-pass and conversational hallucinated samples. A filtering mechanism, enhanced by ground-truth examples, selects the most plausible and challenging samples for evaluation.

Quick Start & Requirements

Install dependencies and run generation/evaluation scripts from generation and evaluation directories.
Requires Python, and access to datasets like HotpotQA, OpenDialKG, and CNN/Daily Mail (links provided for download).
Evaluation requires specifying the task (e.g., qa) and the model to be evaluated (e.g., gpt-3.5-turbo).

Highlighted Details

35K human-annotated and generated hallucinated samples.
Covers QA, dialogue, and summarization tasks.
Includes code for data generation, model evaluation, and analysis (e.g., using LDA).
General user queries are human-annotated for hallucination labels.

Maintenance & Community

The project is associated with the paper "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models" and lists authors from Tsinghua University and the University of Montreal. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and modification.

Limitations & Caveats

The data generation process relies heavily on ChatGPT, which may introduce its own biases or limitations. The effectiveness of the benchmark is dependent on the quality of ChatGPT's generated hallucinations and the human annotation for general queries.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days