HaluEval  by RUCAIBox

Benchmark dataset for LLM hallucination evaluation

created 2 years ago
497 stars

Top 63.3% on sourcepulse

GitHubView on GitHub
Project Summary

HaluEval provides a large-scale benchmark for evaluating the hallucination detection capabilities of Large Language Models (LLMs). It targets researchers and developers working on LLM safety and reliability, offering a comprehensive dataset and tools to assess how well models identify fabricated information across various tasks.

How It Works

HaluEval constructs a 35K sample dataset by generating hallucinated responses for LLMs using ChatGPT. It combines 5,000 general user queries (sampled from Alpaca) with 30,000 task-specific examples from Question Answering (QA), Knowledge-Grounded Dialogue, and Text Summarization. For task-specific data, it uses existing datasets (HotpotQA, OpenDialKG, CNN/Daily Mail) as seeds, employing ChatGPT with specific instructions to create one-pass and conversational hallucinated samples. A filtering mechanism, enhanced by ground-truth examples, selects the most plausible and challenging samples for evaluation.

Quick Start & Requirements

  • Install dependencies and run generation/evaluation scripts from generation and evaluation directories.
  • Requires Python, and access to datasets like HotpotQA, OpenDialKG, and CNN/Daily Mail (links provided for download).
  • Evaluation requires specifying the task (e.g., qa) and the model to be evaluated (e.g., gpt-3.5-turbo).

Highlighted Details

  • 35K human-annotated and generated hallucinated samples.
  • Covers QA, dialogue, and summarization tasks.
  • Includes code for data generation, model evaluation, and analysis (e.g., using LDA).
  • General user queries are human-annotated for hallucination labels.

Maintenance & Community

The project is associated with the paper "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models" and lists authors from Tsinghua University and the University of Montreal. No specific community channels or roadmap are mentioned in the README.

Licensing & Compatibility

The repository is released under the MIT License, permitting commercial use and modification.

Limitations & Caveats

The data generation process relies heavily on ChatGPT, which may introduce its own biases or limitations. The effectiveness of the benchmark is dependent on the quality of ChatGPT's generated hallucinations and the human annotation for general queries.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
31 stars in the last 90 days

Explore Similar Projects

Starred by Jerry Liu Jerry Liu(Cofounder of LlamaIndex), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
1 more.

hallucination-leaderboard by vectara

0.9%
3k
LLM leaderboard for hallucination detection in summarization
created 1 year ago
updated 1 day ago
Feedback? Help us improve.