HallusionBench by tianyi-lab

Image-context reasoning benchmark for vision-language models

Created 2 years ago

322 stars

Top 84.5% on SourcePulse

Project Summary

HallusionBench is a benchmark dataset and evaluation suite designed to diagnose and quantify "language hallucination" and "visual illusion" in large vision-language models (VLMs). It targets researchers and developers working on multimodal AI, providing a structured way to assess model robustness against misleading visual cues and language priors.

How It Works

The benchmark consists of 254 yes/no questions paired with 69 images, categorized into Visual Dependent (VD) and Visual Supplement (VS) types. VD questions require visual context for an answer, while VS questions can be answered without it. Questions are further classified as "Easy" (original images) or "Hard" (edited images), creating question pairs to test consistency. The core idea is to expose VLMs that rely too heavily on language priors, ignoring visual context, or that misinterpret visual information, leading to confident errors.

Quick Start & Requirements

Install by cloning the repository: git clone https://github.com/tianyi-lab/HallusionBench.git
Download images: hallusion_bench.zip and unzip in the repository root.
Evaluation requires running models against HallusionBench.json and saving predictions to HallusionBench_result.json.
Evaluation script: python evaluation.py
For GPT-4V evaluation, modify gpt4v_benchmark.py or download pre-computed results.
Official quick-start and leaderboard: https://github.com/tianyi-lab/HallusionBench

Highlighted Details

Benchmarked models include GPT-4V, LLaVA-1.5, Claude 3, Gemini Pro Vision, and others.
Metrics include Accuracy per Question Pair, Figure Accuracy, and individual question accuracies.
Includes "Hard" versions of images with subtle edits to challenge model robustness.
Paper accepted at CVPR 2024.

Maintenance & Community

The project is associated with the Tianyi Lab.
Related work includes papers on automatic benchmark generation (AutoHallusion at EMNLP 2024) and mitigating hallucinations via instruction tuning.
Community contributions for failure cases are welcomed.

Licensing & Compatibility

License: BSD 3-Clause License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark focuses on yes/no questions, which may limit the granularity of VLM response analysis. While GPT-4V performance is reported, reproducing exact results may depend on specific API versions or evaluation setups.

HallusionBench by tianyi-lab

Explore Similar Projects

lens by ContextualAI

MMVP by tsb0601

SEED-Bench by AILab-CVC

cobra by h-zhao1997

MathVista by lupantech

Seed1.5-VL by ByteDance-Seed

VisionLLM by OpenGVLab

LVM by ytongbai

MiniGPT-4-ZH by RiseInRose

Vary by Ucas-HaoranWei

prismatic-vlms by TRI-ML

DeepSeek-VL2 by deepseek-ai