Benchmark dataset for LLM hallucination evaluation
Top 63.3% on sourcepulse
HaluEval provides a large-scale benchmark for evaluating the hallucination detection capabilities of Large Language Models (LLMs). It targets researchers and developers working on LLM safety and reliability, offering a comprehensive dataset and tools to assess how well models identify fabricated information across various tasks.
How It Works
HaluEval constructs a 35K sample dataset by generating hallucinated responses for LLMs using ChatGPT. It combines 5,000 general user queries (sampled from Alpaca) with 30,000 task-specific examples from Question Answering (QA), Knowledge-Grounded Dialogue, and Text Summarization. For task-specific data, it uses existing datasets (HotpotQA, OpenDialKG, CNN/Daily Mail) as seeds, employing ChatGPT with specific instructions to create one-pass and conversational hallucinated samples. A filtering mechanism, enhanced by ground-truth examples, selects the most plausible and challenging samples for evaluation.
Quick Start & Requirements
generation
and evaluation
directories.qa
) and the model to be evaluated (e.g., gpt-3.5-turbo
).Highlighted Details
Maintenance & Community
The project is associated with the paper "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models" and lists authors from Tsinghua University and the University of Montreal. No specific community channels or roadmap are mentioned in the README.
Licensing & Compatibility
The repository is released under the MIT License, permitting commercial use and modification.
Limitations & Caveats
The data generation process relies heavily on ChatGPT, which may introduce its own biases or limitations. The effectiveness of the benchmark is dependent on the quality of ChatGPT's generated hallucinations and the human annotation for general queries.
1 year ago
Inactive