Woodpecker  by VITA-MLLM

Training-free method for correcting hallucinations in multimodal LLMs

Created 2 years ago
638 stars

Top 52.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Woodpecker addresses the critical issue of hallucination in Multimodal Large Language Models (MLLMs), where generated text contradicts image content. It offers a training-free, post-hoc correction method for researchers and developers working with MLLMs, aiming to improve the factual accuracy and reliability of multimodal outputs.

How It Works

Woodpecker employs a five-stage pipeline: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. This modular, post-remedy approach allows it to be easily integrated with various MLLMs without retraining. The staged process also provides interpretability by exposing intermediate outputs.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.10, spaCy with en_core_web_lg, en_core_web_md, en_core_web_sm models, and GroundingDINO.
  • Usage: Run inference via python inference.py --image-path ... --query ... --text .... Demo setup requires modifying gradio_demo.py and running CUDA_VISIBLE_DEVICES=0,1 python gradio_demo.py.
  • Links: arXiv Paper, Demo, GroundingDINO, spaCy.

Highlighted Details

  • Achieves significant accuracy improvements on POPE benchmark (30.66%/24.33% over baselines).
  • Evaluated on LLaVA, mPLUG-Owl, Otter, and MiniGPT-4.
  • Proposes new open-ended evaluation metrics (accuracy, detailedness) using GPT-4V.
  • Offers interpretability through intermediate outputs.

Maintenance & Community

The project acknowledges contributions from mPLUG-Owl, GroundingDINO, BLIP-2, and LLaMA-Adapter. Contact emails (bradyfu24@gmail.com) and WeChat ID (xjtupanda) are provided for questions.

Licensing & Compatibility

The repository's license is not explicitly stated in the README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README does not specify the license, which could impact commercial adoption. The project relies on external models like GroundingDINO, and its performance may be dependent on the quality of these dependencies.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake), Douwe Kiela Douwe Kiela(Cofounder of Contextual AI), and
1 more.

lens by ContextualAI

0.3%
353
Vision-language research paper using LLMs
Created 2 years ago
Updated 1 month ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Max Howell Max Howell(Author of Homebrew), Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), and
1 more.

big-sleep by lucidrains

0%
3k
CLI tool for text-to-image generation
Created 4 years ago
Updated 3 years ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Luis Capelo Luis Capelo(Cofounder of Lightning AI).

GroundingDINO by IDEA-Research

0.5%
9k
Object detection via grounded pre-training research paper
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Elvis Saravia Elvis Saravia(Founder of DAIR.AI).

DeepSeek-VL2 by deepseek-ai

0.1%
5k
MoE vision-language model for multimodal understanding
Created 9 months ago
Updated 6 months ago
Feedback? Help us improve.