ViP-LLaVA by WisconsinAIVision

Multimodal model for understanding visual prompts

Created 2 years ago

335 stars

Top 82.0% on SourcePulse

Project Summary

ViP-LLaVA enables large multimodal models to understand arbitrary visual prompts by directly overlaying them onto images during training. This approach allows for more intuitive and flexible visual instruction following, benefiting researchers and developers working with multimodal AI.

How It Works

ViP-LLaVA integrates visual prompts (e.g., bounding boxes, masks) directly into the image input during the visual instruction tuning phase. This method allows the model to learn to associate specific visual regions with textual instructions without requiring complex prompt engineering or separate processing steps. The architecture builds upon LLaVA, leveraging its multimodal capabilities and extending them with this novel prompt integration technique.

Quick Start & Requirements

Install: Clone the repository and install via pip install -e . (with .[train] for training).
Prerequisites: Python 3.10, Conda environment recommended. flash-attn is recommended for training.
Demo: Gradio web UI available, requires launching controller, model worker, and web server.
Resources: Model weights are available on Hugging Face (e.g., mucai/vip-llava-7b). Quantized versions (4-bit/8-bit) reduce VRAM requirements, enabling inference on GPUs with as little as 8GB VRAM.
Docs: Project Page, Demo, Model Zoo, Paper, ViP-Bench.

Highlighted Details

CVPR 2024 accepted paper.
Introduces ViP-Bench, a zero-shot region-level benchmark for multimodal models.
Supports Llama-3-8B and Phi-3-mini-3.8B backbones.
CLI inference supports bounding box prompts.

Maintenance & Community

Active development with recent updates (April 2024).
Integrated into official Hugging Face transformers documentation.

Licensing & Compatibility

License: Research use only. Data is CC BY NC 4.0 (non-commercial). Models are restricted by the licenses of their base models (LLaMA, Vicuna, GPT-4).
Compatibility: Not suitable for commercial applications due to non-commercial clauses.

Limitations & Caveats

The project's data and checkpoints are strictly licensed for research purposes only, prohibiting commercial use. Training requires significant GPU resources (e.g., 8x A100 GPUs for full training).

ViP-LLaVA by WisconsinAIVision

Explore Similar Projects

cobra by h-zhao1997

VARGPT by VARGPT-family

LLMGA by JIA-Lab-research

prompt_quill by osi1880vr

Vitron by SkyworkAI

Awesome-Prompting-on-Vision-Language-Model by JindongGu

awesome-prompts by songtianlun

UNO by bytedance

TinyGPT-V by DLYuanGod

MGM by JIA-Lab-research

DeepSeek-VL2 by deepseek-ai

LLaVA by haotian-liu