VideoGPT-plus  by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

created 1 year ago
280 stars

Top 93.9% on sourcepulse

GitHubView on GitHub
Project Summary

VideoGPT+ enhances video understanding by integrating both image and video encoders, offering a dual-encoding approach for richer spatiotemporal feature extraction. It is designed for researchers and developers working on advanced video-based conversational AI and analysis tasks. The project also introduces a new dataset (VCG+ 112K) and a benchmark (VCGBench-Diverse) to facilitate more robust evaluation.

How It Works

VideoGPT+ processes videos by segmenting them and applying adaptive pooling to features extracted from both dedicated image encoders (for spatial detail) and video encoders (for temporal context). This dual-encoding strategy aims to capture a more comprehensive understanding of video content compared to single-encoder approaches. The project leverages existing architectures like LLaVA and Video-ChatGPT, building upon their foundations.

Quick Start & Requirements

  • Install:
    conda create --name=videogpt_plus python=3.11
    conda activate videogpt_plus
    git clone https://github.com/mbzuai-oryx/VideoGPT-plus
    cd VideoGPT-plus
    pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
    pip install transformers==4.41.0
    pip install -r requirements.txt
    export PYTHONPATH="./:$PYTHONPATH"
    
  • FlashAttention (for training):
    pip install ninja
    git clone https://github.com/HazyResearch/flash-attention.git
    cd flash-attention
    python setup.py install
    
  • Prerequisites: Python 3.11, PyTorch 2.1.2 with CUDA 11.8, Transformers 4.41.0.
  • Resources: Requires GPU for training and potentially for inference.
  • Docs: Evaluation instructions are available at eval/README.md, training scripts at scripts/README.md.

Highlighted Details

  • Integrates image and video encoders for enhanced video understanding.
  • Introduces VCG+ 112K dataset for improved instruction tuning.
  • Proposes VCGBench-Diverse, a benchmark with 4,354 QA pairs across 18 categories.
  • Mobile-VideoGPT variant offers 2x higher throughput.

Maintenance & Community

The project is associated with the Mohamed bin Zayed University of Artificial Intelligence. Feedback, contributions, and issues can be raised via the GitHub repository.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license restricts commercial use and requires derivative works to be shared under the same license.

Limitations & Caveats

The CC BY-NC-SA 4.0 license prohibits commercial use. The project builds upon LLaVA and Video-ChatGPT, implying potential dependencies and architectural similarities. FlashAttention is recommended for training, suggesting potential performance benefits but also an additional installation step.

Health Check
Last commit

2 weeks ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
0
Star History
10 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.