ViP-LLaVA  by WisconsinAIVision

Multimodal model for understanding visual prompts

created 1 year ago
327 stars

Top 84.6% on sourcepulse

GitHubView on GitHub
Project Summary

ViP-LLaVA enables large multimodal models to understand arbitrary visual prompts by directly overlaying them onto images during training. This approach allows for more intuitive and flexible visual instruction following, benefiting researchers and developers working with multimodal AI.

How It Works

ViP-LLaVA integrates visual prompts (e.g., bounding boxes, masks) directly into the image input during the visual instruction tuning phase. This method allows the model to learn to associate specific visual regions with textual instructions without requiring complex prompt engineering or separate processing steps. The architecture builds upon LLaVA, leveraging its multimodal capabilities and extending them with this novel prompt integration technique.

Quick Start & Requirements

  • Install: Clone the repository and install via pip install -e . (with .[train] for training).
  • Prerequisites: Python 3.10, Conda environment recommended. flash-attn is recommended for training.
  • Demo: Gradio web UI available, requires launching controller, model worker, and web server.
  • Resources: Model weights are available on Hugging Face (e.g., mucai/vip-llava-7b). Quantized versions (4-bit/8-bit) reduce VRAM requirements, enabling inference on GPUs with as little as 8GB VRAM.
  • Docs: Project Page, Demo, Model Zoo, Paper, ViP-Bench.

Highlighted Details

  • CVPR 2024 accepted paper.
  • Introduces ViP-Bench, a zero-shot region-level benchmark for multimodal models.
  • Supports Llama-3-8B and Phi-3-mini-3.8B backbones.
  • CLI inference supports bounding box prompts.

Maintenance & Community

  • Active development with recent updates (April 2024).
  • Integrated into official Hugging Face transformers documentation.

Licensing & Compatibility

  • License: Research use only. Data is CC BY NC 4.0 (non-commercial). Models are restricted by the licenses of their base models (LLaMA, Vicuna, GPT-4).
  • Compatibility: Not suitable for commercial applications due to non-commercial clauses.

Limitations & Caveats

The project's data and checkpoints are strictly licensed for research purposes only, prohibiting commercial use. Training requires significant GPU resources (e.g., 8x A100 GPUs for full training).

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.