Image-context reasoning benchmark for vision-language models
Top 91.2% on sourcepulse
HallusionBench is a benchmark dataset and evaluation suite designed to diagnose and quantify "language hallucination" and "visual illusion" in large vision-language models (VLMs). It targets researchers and developers working on multimodal AI, providing a structured way to assess model robustness against misleading visual cues and language priors.
How It Works
The benchmark consists of 254 yes/no questions paired with 69 images, categorized into Visual Dependent (VD) and Visual Supplement (VS) types. VD questions require visual context for an answer, while VS questions can be answered without it. Questions are further classified as "Easy" (original images) or "Hard" (edited images), creating question pairs to test consistency. The core idea is to expose VLMs that rely too heavily on language priors, ignoring visual context, or that misinterpret visual information, leading to confident errors.
Quick Start & Requirements
git clone https://github.com/tianyi-lab/HallusionBench.git
hallusion_bench.zip
and unzip in the repository root.HallusionBench.json
and saving predictions to HallusionBench_result.json
.python evaluation.py
gpt4v_benchmark.py
or download pre-computed results.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The benchmark focuses on yes/no questions, which may limit the granularity of VLM response analysis. While GPT-4V performance is reported, reproducing exact results may depend on specific API versions or evaluation setups.
8 months ago
1 day