VLM-Visualizer by zjysteven

Visualizing attention in vision-language models

Created 1 year ago

272 stars

Top 94.9% on SourcePulse

Project Summary

This project provides a tool for visualizing the attention mechanisms of vision-language models (VLMs), enabling users to understand how image inputs influence response generation. It targets researchers and engineers working with VLMs, offering insights into model interpretability and decision-making processes.

How It Works

The core approach combines attention weights from the Large Language Model (LLM) component with those from the Vision Transformer (ViT) encoder. This fusion generates an attention map overlaid on the input image, highlighting image regions most influential for specific generated tokens. This method offers a direct visualization of cross-modal attention flow.

Quick Start & Requirements

Installation: Install compatible torch and torchvision, then run pip install -r requirements.txt.
Prerequisites: torch, torchvision.
Usage: Refer to the llava_example.ipynb Jupyter notebook for a demonstration.
Supported Models: Primarily targets LLaVA v1.5. Newer LLaVA versions (v1.6, Next series) require modifications.

Highlighted Details

Proof of Concept: Demonstrates how attention shifts towards visual tokens when generating relevant words (e.g., ~45% attention on image tokens for "apple").
Cross-Modal Visualization: Generates image overlays showing attention focus during token generation, correlating visual input with textual output.

Maintenance & Community

This project is explicitly marked as a work in progress with open design choices. Contributions and ideas are welcomed via GitHub discussions. The project acknowledges the LLaVA implementation and an attention aggregation repository.

Licensing & Compatibility

The README does not specify a license type or compatibility notes for commercial use.

Limitations & Caveats

The implementation is described as potentially non-rigorous and is currently limited to LLaVA v1.5 models, requiring adaptation for newer versions. Some attention patterns may appear random.

VLM-Visualizer by zjysteven

Explore Similar Projects

LaVIT by jy0205

Lumina-mGPT by Alpha-VLLM

Liquid by FoundationVision

ScreenAI by kyegomez

gill by kohjingyu

Osprey by CircleRadon

VisionLLM by OpenGVLab

Attend-and-Excite by yuval-alaluf

Emu3 by baaivision

MiniGPT-4-ZH by RiseInRose

MiniMax-01 by MiniMax-AI

x-transformers by lucidrains