Video-ChatGPT  by mbzuai-oryx

Video conversation model for detailed video understanding (ACL 2024 paper)

created 2 years ago
1,411 stars

Top 29.4% on sourcepulse

GitHubView on GitHub
Project Summary

Video-ChatGPT is a video conversation model designed for detailed video understanding, enabling users to engage in meaningful dialogue about video content. It targets researchers and developers working with multimodal AI, offering a novel approach that combines Large Language Models (LLMs) with a spatiotemporal visual encoder, and introduces a rigorous quantitative evaluation framework for video-based conversational models.

How It Works

The model integrates a pre-trained visual encoder, adapted for spatiotemporal video representation, with the capabilities of LLMs. This fusion allows for a deeper understanding of video content, enabling the generation of detailed and contextually relevant conversations. The project also emphasizes a new quantitative evaluation framework to benchmark video-conversation models rigorously.

Quick Start & Requirements

  • Installation: Requires setting up a conda environment (Python 3.10 recommended), cloning the repository, and installing dependencies via pip install -r requirements.txt.
  • Prerequisites: FlashAttention (v1.0.7) is recommended for training.
  • Resources: Links to demo, paper, training code, and datasets are provided.

Highlighted Details

  • Achieves state-of-the-art performance on multiple benchmarks, outperforming models like Video Chat, LLaMA Adapter, and Video LLaMA.
  • Introduces VideoInstruct100K, a dataset of 100,000 high-quality video-instruction pairs.
  • Developed a dedicated quantitative evaluation framework (VCGBench-Diverse) for video-based conversational models.
  • Offers an offline demo and online demo for immediate interaction.

Maintenance & Community

The project is associated with Muhammad bin Zayed University of Artificial Intelligence (MBZUAI). Recent updates include Mobile-VideoGPT and VideoGPT+, indicating active development. A semi-automatic video annotation pipeline and VCGBench-Diverse benchmarks have also been released. Issues and questions can be raised via the GitHub repository.

Licensing & Compatibility

Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license restricts commercial use and requires sharing adaptations under the same terms.

Limitations & Caveats

The non-commercial license restricts its use in commercial applications. While the project is actively developed with recent updates and benchmarks, users should verify compatibility with their specific use cases and environments.

Health Check
Last commit

4 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
64 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.