Video-LLaVA  by mbzuai-oryx

Video-language model with pixel-level grounding

created 1 year ago
257 stars

Top 98.8% on sourcepulse

GitHubView on GitHub
Project Summary

PG-Video-LLaVA is a novel video-based Large Multimodal Model (LMM) designed for pixel-level grounding in videos. It targets researchers and developers working with video understanding, enabling precise spatial localization of objects based on user instructions and audio context. The primary benefit is its ability to perform fine-grained object tracking and interaction within video content.

How It Works

PG-Video-LLaVA employs a modular architecture, integrating an off-the-shelf tracker with a custom grounding module. This approach allows it to spatially ground objects in videos by following user prompts. Crucially, it incorporates audio context to enhance video comprehension, making it particularly effective for content with dialogue or spoken information. The model builds upon a strong image-LMM baseline, offering improved conversational abilities over prior video-based models.

Quick Start & Requirements

  • Installation: Refer to the instructions here.
  • Prerequisites: Likely requires Python, PyTorch, and potentially specific versions of CUDA for GPU acceleration. Detailed requirements are in the linked instructions.
  • Resources: Setup and inference will likely require significant GPU resources and memory, typical for large multimodal models.

Highlighted Details

  • First video-based LMM with pixel-level grounding capabilities.
  • Introduces a new benchmark for prompt-based object grounding in videos.
  • Leverages audio context for enhanced video understanding.
  • Evaluates generative performance using Vicuna-13b-v1.5 for reproducibility.

Maintenance & Community

The project was released on December 27, 2023, with code and models. Further community engagement details (e.g., Discord, Slack) are not explicitly mentioned in the README.

Licensing & Compatibility

The README does not explicitly state the license. Given its reliance on LLaVA and Vicuna, users should verify compatibility with their respective licenses, which may have restrictions on commercial use.

Limitations & Caveats

The project is presented as a recent release, and its stability, performance on diverse real-world scenarios, and long-term maintenance are yet to be established. Detailed quantitative evaluations are provided, but practical implementation challenges may exist.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.