HallusionBench  by tianyi-lab

Image-context reasoning benchmark for vision-language models

created 1 year ago
293 stars

Top 91.2% on sourcepulse

GitHubView on GitHub
Project Summary

HallusionBench is a benchmark dataset and evaluation suite designed to diagnose and quantify "language hallucination" and "visual illusion" in large vision-language models (VLMs). It targets researchers and developers working on multimodal AI, providing a structured way to assess model robustness against misleading visual cues and language priors.

How It Works

The benchmark consists of 254 yes/no questions paired with 69 images, categorized into Visual Dependent (VD) and Visual Supplement (VS) types. VD questions require visual context for an answer, while VS questions can be answered without it. Questions are further classified as "Easy" (original images) or "Hard" (edited images), creating question pairs to test consistency. The core idea is to expose VLMs that rely too heavily on language priors, ignoring visual context, or that misinterpret visual information, leading to confident errors.

Quick Start & Requirements

  • Install by cloning the repository: git clone https://github.com/tianyi-lab/HallusionBench.git
  • Download images: hallusion_bench.zip and unzip in the repository root.
  • Evaluation requires running models against HallusionBench.json and saving predictions to HallusionBench_result.json.
  • Evaluation script: python evaluation.py
  • For GPT-4V evaluation, modify gpt4v_benchmark.py or download pre-computed results.
  • Official quick-start and leaderboard: https://github.com/tianyi-lab/HallusionBench

Highlighted Details

  • Benchmarked models include GPT-4V, LLaVA-1.5, Claude 3, Gemini Pro Vision, and others.
  • Metrics include Accuracy per Question Pair, Figure Accuracy, and individual question accuracies.
  • Includes "Hard" versions of images with subtle edits to challenge model robustness.
  • Paper accepted at CVPR 2024.

Maintenance & Community

  • The project is associated with the Tianyi Lab.
  • Related work includes papers on automatic benchmark generation (AutoHallusion at EMNLP 2024) and mitigating hallucinations via instruction tuning.
  • Community contributions for failure cases are welcomed.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The benchmark focuses on yes/no questions, which may limit the granularity of VLM response analysis. While GPT-4V performance is reported, reproducing exact results may depend on specific API versions or evaluation setups.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.