LLM leaderboard for hallucination detection in summarization
Top 18.3% on sourcepulse
This repository hosts a public leaderboard that ranks Large Language Models (LLMs) based on their propensity to hallucinate when summarizing short documents. It is targeted at LLM developers, researchers, and users of LLM-powered applications, providing a quantifiable metric for factual consistency in summarization tasks, which is crucial for applications like Retrieval Augmented Generation (RAG).
How It Works
The leaderboard utilizes Vectara's HHEM-2.1 model, a proprietary hallucination evaluation model, to assess LLM performance. LLMs are fed 1000 short documents and tasked with summarizing them. The evaluation focuses on factual consistency within the generated summaries relative to the source documents, rather than overall factual accuracy or summarization quality. This approach is chosen because it allows for a repeatable, scalable evaluation without needing to know the LLM's training data.
Quick Start & Requirements
This repository primarily serves as a results dashboard and does not offer direct installation or execution commands for running the evaluation. The evaluation methodology and the HHEM-2.1-Open model are available separately on Hugging Face and Kaggle.
Highlighted Details
Maintenance & Community
The leaderboard is updated regularly. The project encourages community involvement and provides links to prior research and related resources. Further details can be found in their blog post "Cut the Bull…. Detecting Hallucinations in Large Language Models."
Licensing & Compatibility
The repository itself does not specify a license. The underlying evaluation model, HHEM-2.1-Open, is available on Hugging Face and Kaggle, where its specific license terms would apply.
Limitations & Caveats
The leaderboard measures hallucination rates specifically within summarization tasks and does not evaluate overall summarization quality or general question-answering capabilities. The evaluation is limited to English language content. The project acknowledges that this is a starting point and plans to expand coverage to citation accuracy, multi-document summarization, and more languages.
1 day ago
1 day