This repository provides implementations for two advanced video understanding models: MiniGPT4-Video for short videos and Goldfish for arbitrarily long videos. It addresses challenges in processing lengthy video content by employing a retrieval mechanism and offers solutions for both research and practical applications in multimodal AI.
How It Works
Goldfish tackles long video understanding by first retrieving relevant video clips using an efficient mechanism, then processing these clips to generate responses. This approach mitigates the "noise and redundancy challenge" and "memory and computation" constraints of processing entire long videos. MiniGPT4-Video supports this by generating detailed descriptions for video clips, enhancing the retrieval process.
Quick Start & Requirements
- Install: Clone the repository and set up the environment using
conda env create -f environment.yml
.
- Prerequisites: Requires checkpoints for MiniGPT4-Video (Llama2 Chat 7B or Mistral 7B). Optional: OpenAI API key for enhanced embedding performance.
- Demo/Inference: Provided scripts for running demos and performing inference for both Goldfish and MiniGPT4-Video, supporting Llama2 and Mistral backends.
- Links: Project Page, arXiv Paper
Highlighted Details
- Goldfish achieves 41.78% accuracy on the TVQA-long benchmark, surpassing prior methods by 14.94%.
- MiniGPT4-Video outperforms state-of-the-art on short video benchmarks (MSVD, MSRVTT, TGIF, TVQA) by up to 23.59%.
- Supports both Llama2 and Mistral LLM backends.
- Includes comprehensive evaluation scripts and benchmark results for both short and long video understanding tasks.
Maintenance & Community
- The project is associated with ECCV 2024 and CVPR2024W.
- Citation details are provided in BibTeX format.
- Acknowledgements mention MiniGPT4 and Video-ChatGPT.
Licensing & Compatibility
- License: BSD 3-Clause License.
- Compatibility: Based on MiniGPT4. Commercial use is generally permitted under BSD 3-Clause, but users should verify specific dependencies.
Limitations & Caveats
- The README mentions that some subtitles for evaluation datasets (MSR-VTT, ActivityNet) are generated using Whisper, which may impact performance consistency.
- For optimal performance, using OpenAI embeddings is recommended, requiring an API key.