Video-LLaVA by mbzuai-oryx

Video-language model with pixel-level grounding

Created 2 years ago

261 stars

Top 97.5% on SourcePulse

Project Summary

PG-Video-LLaVA is a novel video-based Large Multimodal Model (LMM) designed for pixel-level grounding in videos. It targets researchers and developers working with video understanding, enabling precise spatial localization of objects based on user instructions and audio context. The primary benefit is its ability to perform fine-grained object tracking and interaction within video content.

How It Works

PG-Video-LLaVA employs a modular architecture, integrating an off-the-shelf tracker with a custom grounding module. This approach allows it to spatially ground objects in videos by following user prompts. Crucially, it incorporates audio context to enhance video comprehension, making it particularly effective for content with dialogue or spoken information. The model builds upon a strong image-LMM baseline, offering improved conversational abilities over prior video-based models.

Quick Start & Requirements

Installation: Refer to the instructions here.
Prerequisites: Likely requires Python, PyTorch, and potentially specific versions of CUDA for GPU acceleration. Detailed requirements are in the linked instructions.
Resources: Setup and inference will likely require significant GPU resources and memory, typical for large multimodal models.

Highlighted Details

First video-based LMM with pixel-level grounding capabilities.
Introduces a new benchmark for prompt-based object grounding in videos.
Leverages audio context for enhanced video understanding.
Evaluates generative performance using Vicuna-13b-v1.5 for reproducibility.

Maintenance & Community

The project was released on December 27, 2023, with code and models. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license. Given its reliance on LLaVA and Vicuna, users should verify compatibility with their respective licenses, which may have restrictions on commercial use.

Limitations & Caveats

The project is presented as a recent release, and its stability, performance on diverse real-world scenarios, and long-term maintenance are yet to be established. Detailed quantitative evaluations are provided, but practical implementation challenges may exist.

Video-LLaVA by mbzuai-oryx

Explore Similar Projects

GroundingGPT by lzw-lzw

VideoGPT-plus by mbzuai-oryx

bubogpt by magic-research

Vitron by SkyworkAI

Groma by FoundationVision

LLaVA-Plus-Codebase by LLaVA-VL

VideoLLaMA3 by DAMO-NLP-SG

VideoLLaMA2 by DAMO-NLP-SG

InternLM-XComposer by InternLM

Video-LLaMA by DAMO-NLP-SG

VideoPipe by sherlockchou86

Qwen3-VL by QwenLM