hallucination-leaderboard  by vectara

LLM leaderboard for hallucination detection in summarization

created 1 year ago
2,632 stars

Top 18.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository hosts a public leaderboard that ranks Large Language Models (LLMs) based on their propensity to hallucinate when summarizing short documents. It is targeted at LLM developers, researchers, and users of LLM-powered applications, providing a quantifiable metric for factual consistency in summarization tasks, which is crucial for applications like Retrieval Augmented Generation (RAG).

How It Works

The leaderboard utilizes Vectara's HHEM-2.1 model, a proprietary hallucination evaluation model, to assess LLM performance. LLMs are fed 1000 short documents and tasked with summarizing them. The evaluation focuses on factual consistency within the generated summaries relative to the source documents, rather than overall factual accuracy or summarization quality. This approach is chosen because it allows for a repeatable, scalable evaluation without needing to know the LLM's training data.

Quick Start & Requirements

This repository primarily serves as a results dashboard and does not offer direct installation or execution commands for running the evaluation. The evaluation methodology and the HHEM-2.1-Open model are available separately on Hugging Face and Kaggle.

Highlighted Details

  • Evaluates LLMs on hallucination rates using the HHEM-2.1 model.
  • Tests LLMs on summarizing 1000 short documents, primarily from the CNN/Daily Mail Corpus.
  • Provides metrics including Hallucination Rate, Factual Consistency Rate, Answer Rate, and Average Summary Length.
  • Details API integration specifics for numerous OpenAI, Llama, Cohere, Anthropic, Mistral, Google, Amazon, Microsoft, and other models.

Maintenance & Community

The leaderboard is updated regularly. The project encourages community involvement and provides links to prior research and related resources. Further details can be found in their blog post "Cut the Bull…. Detecting Hallucinations in Large Language Models."

Licensing & Compatibility

The repository itself does not specify a license. The underlying evaluation model, HHEM-2.1-Open, is available on Hugging Face and Kaggle, where its specific license terms would apply.

Limitations & Caveats

The leaderboard measures hallucination rates specifically within summarization tasks and does not evaluate overall summarization quality or general question-answering capabilities. The evaluation is limited to English language content. The project acknowledges that this is a starting point and plans to expand coverage to citation accuracy, multi-document summarization, and more languages.

Health Check
Last commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
2
Star History
371 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Travis Fischer Travis Fischer(Founder of Agentic).

HaluEval by RUCAIBox

0.2%
497
Benchmark dataset for LLM hallucination evaluation
created 2 years ago
updated 1 year ago
Feedback? Help us improve.