Multimodal model for understanding visual prompts
Top 84.6% on sourcepulse
ViP-LLaVA enables large multimodal models to understand arbitrary visual prompts by directly overlaying them onto images during training. This approach allows for more intuitive and flexible visual instruction following, benefiting researchers and developers working with multimodal AI.
How It Works
ViP-LLaVA integrates visual prompts (e.g., bounding boxes, masks) directly into the image input during the visual instruction tuning phase. This method allows the model to learn to associate specific visual regions with textual instructions without requiring complex prompt engineering or separate processing steps. The architecture builds upon LLaVA, leveraging its multimodal capabilities and extending them with this novel prompt integration technique.
Quick Start & Requirements
pip install -e .
(with .[train]
for training).flash-attn
is recommended for training.mucai/vip-llava-7b
). Quantized versions (4-bit/8-bit) reduce VRAM requirements, enabling inference on GPUs with as little as 8GB VRAM.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project's data and checkpoints are strictly licensed for research purposes only, prohibiting commercial use. Training requires significant GPU resources (e.g., 8x A100 GPUs for full training).
1 year ago
1 day