Video-language model integrating image/video encoders for enhanced video understanding
Top 93.9% on sourcepulse
VideoGPT+ enhances video understanding by integrating both image and video encoders, offering a dual-encoding approach for richer spatiotemporal feature extraction. It is designed for researchers and developers working on advanced video-based conversational AI and analysis tasks. The project also introduces a new dataset (VCG+ 112K) and a benchmark (VCGBench-Diverse) to facilitate more robust evaluation.
How It Works
VideoGPT+ processes videos by segmenting them and applying adaptive pooling to features extracted from both dedicated image encoders (for spatial detail) and video encoders (for temporal context). This dual-encoding strategy aims to capture a more comprehensive understanding of video content compared to single-encoder approaches. The project leverages existing architectures like LLaVA and Video-ChatGPT, building upon their foundations.
Quick Start & Requirements
conda create --name=videogpt_plus python=3.11
conda activate videogpt_plus
git clone https://github.com/mbzuai-oryx/VideoGPT-plus
cd VideoGPT-plus
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.41.0
pip install -r requirements.txt
export PYTHONPATH="./:$PYTHONPATH"
pip install ninja
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install
eval/README.md
, training scripts at scripts/README.md
.Highlighted Details
Maintenance & Community
The project is associated with the Mohamed bin Zayed University of Artificial Intelligence. Feedback, contributions, and issues can be raised via the GitHub repository.
Licensing & Compatibility
Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0). This license restricts commercial use and requires derivative works to be shared under the same license.
Limitations & Caveats
The CC BY-NC-SA 4.0 license prohibits commercial use. The project builds upon LLaVA and Video-ChatGPT, implying potential dependencies and architectural similarities. FlashAttention is recommended for training, suggesting potential performance benefits but also an additional installation step.
2 weeks ago
1 day