hallucination-leaderboard by vectara

LLM leaderboard for hallucination detection in summarization

Created 2 years ago

2,997 stars

Top 15.9% on SourcePulse

View on GitHub

6 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Amin Ahmad

Cofounder of Vectara

Jerry Liu

Cofounder of LlamaIndex

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 2 more!

Project Summary

This repository hosts a public leaderboard that ranks Large Language Models (LLMs) based on their propensity to hallucinate when summarizing short documents. It is targeted at LLM developers, researchers, and users of LLM-powered applications, providing a quantifiable metric for factual consistency in summarization tasks, which is crucial for applications like Retrieval Augmented Generation (RAG).

How It Works

The leaderboard utilizes Vectara's HHEM-2.1 model, a proprietary hallucination evaluation model, to assess LLM performance. LLMs are fed 1000 short documents and tasked with summarizing them. The evaluation focuses on factual consistency within the generated summaries relative to the source documents, rather than overall factual accuracy or summarization quality. This approach is chosen because it allows for a repeatable, scalable evaluation without needing to know the LLM's training data.

Quick Start & Requirements

This repository primarily serves as a results dashboard and does not offer direct installation or execution commands for running the evaluation. The evaluation methodology and the HHEM-2.1-Open model are available separately on Hugging Face and Kaggle.

Highlighted Details

Evaluates LLMs on hallucination rates using the HHEM-2.1 model.
Tests LLMs on summarizing 1000 short documents, primarily from the CNN/Daily Mail Corpus.
Provides metrics including Hallucination Rate, Factual Consistency Rate, Answer Rate, and Average Summary Length.
Details API integration specifics for numerous OpenAI, Llama, Cohere, Anthropic, Mistral, Google, Amazon, Microsoft, and other models.

Maintenance & Community

The leaderboard is updated regularly. The project encourages community involvement and provides links to prior research and related resources. Further details can be found in their blog post "Cut the Bull…. Detecting Hallucinations in Large Language Models."

Licensing & Compatibility

The repository itself does not specify a license. The underlying evaluation model, HHEM-2.1-Open, is available on Hugging Face and Kaggle, where its specific license terms would apply.

Limitations & Caveats

The leaderboard measures hallucination rates specifically within summarization tasks and does not evaluate overall summarization quality or general question-answering capabilities. The evaluation is limited to English language content. The project acknowledges that this is a starting point and plans to expand coverage to citation accuracy, multi-document summarization, and more languages.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

124 stars in the last 30 days