MiniGPT4-video  by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago
628 stars

Top 52.7% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides implementations for two advanced video understanding models: MiniGPT4-Video for short videos and Goldfish for arbitrarily long videos. It addresses challenges in processing lengthy video content by employing a retrieval mechanism and offers solutions for both research and practical applications in multimodal AI.

How It Works

Goldfish tackles long video understanding by first retrieving relevant video clips using an efficient mechanism, then processing these clips to generate responses. This approach mitigates the "noise and redundancy challenge" and "memory and computation" constraints of processing entire long videos. MiniGPT4-Video supports this by generating detailed descriptions for video clips, enhancing the retrieval process.

Quick Start & Requirements

  • Install: Clone the repository and set up the environment using conda env create -f environment.yml.
  • Prerequisites: Requires checkpoints for MiniGPT4-Video (Llama2 Chat 7B or Mistral 7B). Optional: OpenAI API key for enhanced embedding performance.
  • Demo/Inference: Provided scripts for running demos and performing inference for both Goldfish and MiniGPT4-Video, supporting Llama2 and Mistral backends.
  • Links: Project Page, arXiv Paper

Highlighted Details

  • Goldfish achieves 41.78% accuracy on the TVQA-long benchmark, surpassing prior methods by 14.94%.
  • MiniGPT4-Video outperforms state-of-the-art on short video benchmarks (MSVD, MSRVTT, TGIF, TVQA) by up to 23.59%.
  • Supports both Llama2 and Mistral LLM backends.
  • Includes comprehensive evaluation scripts and benchmark results for both short and long video understanding tasks.

Maintenance & Community

  • The project is associated with ECCV 2024 and CVPR2024W.
  • Citation details are provided in BibTeX format.
  • Acknowledgements mention MiniGPT4 and Video-ChatGPT.

Licensing & Compatibility

  • License: BSD 3-Clause License.
  • Compatibility: Based on MiniGPT4. Commercial use is generally permitted under BSD 3-Clause, but users should verify specific dependencies.

Limitations & Caveats

  • The README mentions that some subtitles for evaluation datasets (MSR-VTT, ActivityNet) are generated using Whisper, which may impact performance consistency.
  • For optimal performance, using OpenAI embeddings is recommended, requiring an API key.
Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.