VCD by DAMO-NLP-SG

Training-free method for mitigating hallucinations in LVLMs during decoding

Created 2 years ago

346 stars

Top 80.1% on SourcePulse

Project Summary

This repository provides Visual Contrastive Decoding (VCD), a training-free method to reduce object hallucinations in Large Vision-Language Models (LVLMs). It's designed for researchers and developers working with LVLMs who need to improve the factual accuracy of generated text without retraining models. VCD offers a simple integration to enhance existing LVLM pipelines.

How It Works

VCD operates by contrasting the output probability distributions derived from original and slightly distorted visual inputs. The core idea is to formulate a new decoding probability that down-weights tokens associated with spurious correlations in the original image while up-weighting those consistent across variations. This approach aims to mitigate over-reliance on statistical biases and unimodal priors, which are identified as key causes of object hallucinations.

Quick Start & Requirements

Install:

conda create -yn vcd python=3.9
conda activate vcd
cd VCD
pip install -r requirements.txt

Prerequisites: Python 3.9, Conda environment. Specific LVLM integration requires models like LLaVA, InstructBLIP, or Qwen-VL.
Integration: Requires modifying model generation scripts to incorporate vcd_utils.vcd_sample.evolve_vcd_sampling() and passing images_cd and cd_alpha/cd_beta parameters to the model.generate function.
Resources: Requires GPU for inference.

Highlighted Details

Selected as a Poster Highlight at CVPR 2024.
Demonstrated significant reduction in object hallucinations across various LVLM families (e.g., LLaVA, InstructBLIP, Qwen-VL) on benchmarks like POPE.
Enhances general LVLM capabilities, including perception and recognition, without compromising accuracy.
Achieves improved GPT-4V-aided evaluation scores for accuracy and detailedness in open-ended generation.

Maintenance & Community

The project is associated with DAMO-NLP-SG. The paper is available on arXiv. Related projects include Contrastive Decoding, InstructBLIP, Qwen-VL, and LLaVA 1.5.

Licensing & Compatibility

The repository does not explicitly state a license in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not detail specific limitations, unsupported platforms, or known bugs. The integration requires modifying existing model generation code, which may introduce complexity.

VCD by DAMO-NLP-SG

Explore Similar Projects

CM3Leon by kyegomez

LLaVA-UHD by thunlp

UniWorld by PKU-YuanGroup

VoRA by Hon-Wong

Woodpecker by VITA-MLLM

OPERA by shikiw

Liquid by FoundationVision

DiffuEraser by lixiaowen-xw

transfusion-pytorch by lucidrains

FateZero by ChenyangQiQi

MAGI-1 by SandAI-org

KAIR by cszn