VLM-Visualizer  by zjysteven

Visualizing attention in vision-language models

Created 1 year ago
251 stars

Top 99.8% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a tool for visualizing the attention mechanisms of vision-language models (VLMs), enabling users to understand how image inputs influence response generation. It targets researchers and engineers working with VLMs, offering insights into model interpretability and decision-making processes.

How It Works

The core approach combines attention weights from the Large Language Model (LLM) component with those from the Vision Transformer (ViT) encoder. This fusion generates an attention map overlaid on the input image, highlighting image regions most influential for specific generated tokens. This method offers a direct visualization of cross-modal attention flow.

Quick Start & Requirements

  • Installation: Install compatible torch and torchvision, then run pip install -r requirements.txt.
  • Prerequisites: torch, torchvision.
  • Usage: Refer to the llava_example.ipynb Jupyter notebook for a demonstration.
  • Supported Models: Primarily targets LLaVA v1.5. Newer LLaVA versions (v1.6, Next series) require modifications.

Highlighted Details

  • Proof of Concept: Demonstrates how attention shifts towards visual tokens when generating relevant words (e.g., ~45% attention on image tokens for "apple").
  • Cross-Modal Visualization: Generates image overlays showing attention focus during token generation, correlating visual input with textual output.

Maintenance & Community

This project is explicitly marked as a work in progress with open design choices. Contributions and ideas are welcomed via GitHub discussions. The project acknowledges the LLaVA implementation and an attention aggregation repository.

Licensing & Compatibility

The README does not specify a license type or compatibility notes for commercial use.

Limitations & Caveats

The implementation is described as potentially non-rigorous and is currently limited to LLaVA v1.5 models, requiring adaptation for newer versions. Some attention patterns may appear random.

Health Check
Last Commit

8 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0.4%
468
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 5 years ago
Updated 2 days ago
Feedback? Help us improve.