VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

Created 1 year ago

291 stars

Top 90.7% on SourcePulse

Project Summary

VideoGPT+ enhances video understanding by integrating both image and video encoders, offering a dual-encoding approach for richer spatiotemporal feature extraction. It is designed for researchers and developers working on advanced video-based conversational AI and analysis tasks. The project also introduces a new dataset (VCG+ 112K) and a benchmark (VCGBench-Diverse) to facilitate more robust evaluation.

How It Works

VideoGPT+ processes videos by segmenting them and applying adaptive pooling to features extracted from both dedicated image encoders (for spatial detail) and video encoders (for temporal context). This dual-encoding strategy aims to capture a more comprehensive understanding of video content compared to single-encoder approaches. The project leverages existing architectures like LLaVA and Video-ChatGPT, building upon their foundations.

Quick Start & Requirements

Install:

conda create --name=videogpt_plus python=3.11
conda activate videogpt_plus
git clone https://github.com/mbzuai-oryx/VideoGPT-plus
cd VideoGPT-plus
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.41.0
pip install -r requirements.txt
export PYTHONPATH="./:$PYTHONPATH"

FlashAttention (for training):

pip install ninja
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

Prerequisites: Python 3.11, PyTorch 2.1.2 with CUDA 11.8, Transformers 4.41.0.
Resources: Requires GPU for training and potentially for inference.
Docs: Evaluation instructions are available at eval/README.md, training scripts at scripts/README.md.

Highlighted Details

Integrates image and video encoders for enhanced video understanding.
Introduces VCG+ 112K dataset for improved instruction tuning.
Proposes VCGBench-Diverse, a benchmark with 4,354 QA pairs across 18 categories.
Mobile-VideoGPT variant offers 2x higher throughput.

Maintenance & Community

The project is associated with the Mohamed bin Zayed University of Artificial Intelligence. Feedback, contributions, and issues can be raised via the GitHub repository.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license restricts commercial use and requires derivative works to be shared under the same license.

Limitations & Caveats

The CC BY-NC-SA 4.0 license prohibits commercial use. The project builds upon LLaVA and Video-ChatGPT, implying potential dependencies and architectural similarities. FlashAttention is recommended for training, suggesting potential performance benefits but also an additional installation step.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days