MiniGPT4-video  by Vision-CAIR

Video-language model for short and long video understanding

created 1 year ago
627 stars

Top 53.6% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides implementations for two advanced video understanding models: MiniGPT4-Video for short videos and Goldfish for arbitrarily long videos. It addresses challenges in processing lengthy video content by employing a retrieval mechanism and offers solutions for both research and practical applications in multimodal AI.

How It Works

Goldfish tackles long video understanding by first retrieving relevant video clips using an efficient mechanism, then processing these clips to generate responses. This approach mitigates the "noise and redundancy challenge" and "memory and computation" constraints of processing entire long videos. MiniGPT4-Video supports this by generating detailed descriptions for video clips, enhancing the retrieval process.

Quick Start & Requirements

  • Install: Clone the repository and set up the environment using conda env create -f environment.yml.
  • Prerequisites: Requires checkpoints for MiniGPT4-Video (Llama2 Chat 7B or Mistral 7B). Optional: OpenAI API key for enhanced embedding performance.
  • Demo/Inference: Provided scripts for running demos and performing inference for both Goldfish and MiniGPT4-Video, supporting Llama2 and Mistral backends.
  • Links: Project Page, arXiv Paper

Highlighted Details

  • Goldfish achieves 41.78% accuracy on the TVQA-long benchmark, surpassing prior methods by 14.94%.
  • MiniGPT4-Video outperforms state-of-the-art on short video benchmarks (MSVD, MSRVTT, TGIF, TVQA) by up to 23.59%.
  • Supports both Llama2 and Mistral LLM backends.
  • Includes comprehensive evaluation scripts and benchmark results for both short and long video understanding tasks.

Maintenance & Community

  • The project is associated with ECCV 2024 and CVPR2024W.
  • Citation details are provided in BibTeX format.
  • Acknowledgements mention MiniGPT4 and Video-ChatGPT.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Based on MiniGPT4. Commercial use is generally permitted under BSD 3-Clause, but users should verify specific dependencies.

Limitations & Caveats

  • The README mentions that some subtitles for evaluation datasets (MSR-VTT, ActivityNet) are generated using Whisper, which may impact performance consistency.
  • For optimal performance, using OpenAI embeddings is recommended, requiring an API key.
Health Check
Last commit

7 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.