VCD  by DAMO-NLP-SG

Training-free method for mitigating hallucinations in LVLMs during decoding

created 1 year ago
301 stars

Top 89.6% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides Visual Contrastive Decoding (VCD), a training-free method to reduce object hallucinations in Large Vision-Language Models (LVLMs). It's designed for researchers and developers working with LVLMs who need to improve the factual accuracy of generated text without retraining models. VCD offers a simple integration to enhance existing LVLM pipelines.

How It Works

VCD operates by contrasting the output probability distributions derived from original and slightly distorted visual inputs. The core idea is to formulate a new decoding probability that down-weights tokens associated with spurious correlations in the original image while up-weighting those consistent across variations. This approach aims to mitigate over-reliance on statistical biases and unimodal priors, which are identified as key causes of object hallucinations.

Quick Start & Requirements

  • Install:
    conda create -yn vcd python=3.9
    conda activate vcd
    cd VCD
    pip install -r requirements.txt
    
  • Prerequisites: Python 3.9, Conda environment. Specific LVLM integration requires models like LLaVA, InstructBLIP, or Qwen-VL.
  • Integration: Requires modifying model generation scripts to incorporate vcd_utils.vcd_sample.evolve_vcd_sampling() and passing images_cd and cd_alpha/cd_beta parameters to the model.generate function.
  • Resources: Requires GPU for inference.

Highlighted Details

  • Selected as a Poster Highlight at CVPR 2024.
  • Demonstrated significant reduction in object hallucinations across various LVLM families (e.g., LLaVA, InstructBLIP, Qwen-VL) on benchmarks like POPE.
  • Enhances general LVLM capabilities, including perception and recognition, without compromising accuracy.
  • Achieves improved GPT-4V-aided evaluation scores for accuracy and detailedness in open-ended generation.

Maintenance & Community

The project is associated with DAMO-NLP-SG. The paper is available on arXiv. Related projects include Contrastive Decoding, InstructBLIP, Qwen-VL, and LLaVA 1.5.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The integration requires modifying existing model generation code, which may introduce complexity.

Health Check
Last commit

10 months ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
32 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.