MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

639 stars

Top 52.1% on SourcePulse

1 Expert Loves This Project

Ying1123

Coauthor of SGLang

Project Summary

This repository provides implementations for two advanced video understanding models: MiniGPT4-Video for short videos and Goldfish for arbitrarily long videos. It addresses challenges in processing lengthy video content by employing a retrieval mechanism and offers solutions for both research and practical applications in multimodal AI.

How It Works

Goldfish tackles long video understanding by first retrieving relevant video clips using an efficient mechanism, then processing these clips to generate responses. This approach mitigates the "noise and redundancy challenge" and "memory and computation" constraints of processing entire long videos. MiniGPT4-Video supports this by generating detailed descriptions for video clips, enhancing the retrieval process.

Quick Start & Requirements

Install: Clone the repository and set up the environment using conda env create -f environment.yml.
Prerequisites: Requires checkpoints for MiniGPT4-Video (Llama2 Chat 7B or Mistral 7B). Optional: OpenAI API key for enhanced embedding performance.
Demo/Inference: Provided scripts for running demos and performing inference for both Goldfish and MiniGPT4-Video, supporting Llama2 and Mistral backends.
Links: Project Page, arXiv Paper

Highlighted Details

Goldfish achieves 41.78% accuracy on the TVQA-long benchmark, surpassing prior methods by 14.94%.
MiniGPT4-Video outperforms state-of-the-art on short video benchmarks (MSVD, MSRVTT, TGIF, TVQA) by up to 23.59%.
Supports both Llama2 and Mistral LLM backends.
Includes comprehensive evaluation scripts and benchmark results for both short and long video understanding tasks.

Maintenance & Community

The project is associated with ECCV 2024 and CVPR2024W.
Citation details are provided in BibTeX format.
Acknowledgements mention MiniGPT4 and Video-ChatGPT.

Licensing & Compatibility

License: BSD 3-Clause License.
Compatibility: Based on MiniGPT4. Commercial use is generally permitted under BSD 3-Clause, but users should verify specific dependencies.

Limitations & Caveats

The README mentions that some subtitles for evaluation datasets (MSR-VTT, ActivityNet) are generated using Whisper, which may impact performance consistency.
For optimal performance, using OpenAI embeddings is recommended, requiring an API key.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

0

Issues (30d)

0

Star History

2 stars in the last 30 days

Explore Similar Projects

DeepVideoDiscovery by microsoft

Agentic search for understanding extra-long videos

Created 7 months ago

Updated 2 months ago

Youku-mPLUG by X-PLUG

Chinese video-language dataset and benchmarks for pre-training

Created 2 years ago

Updated 2 years ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

MiraData by mira-space

Video dataset for long video generation research

Created 1 year ago

Updated 1 year ago

ml-slowfast-llava by apple

Video understanding and reasoning with a training-free LLM

Created 1 year ago

Updated 1 year ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

ShareGPT4Video by ShareGPT4Omni

Research paper for video understanding/generation via improved captions

Created 1 year ago

Updated 1 year ago

moment_detr by jayleicn

Video moment retrieval via natural language queries (NeurIPS 2021 paper)

Created 4 years ago

Updated 1 year ago

vced by datawhalechina

Video clip extraction via text descriptions

Created 3 years ago

Updated 2 years ago

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

Video-ChatGPT by mbzuai-oryx

Video conversation model for detailed video understanding (ACL 2024 paper)

Created 2 years ago

Updated 5 months ago

Awesome-LLMs-for-Video-Understanding by yunlong10

Survey of video understanding via LLMs

Created 2 years ago

Updated 3 weeks ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

VideoRAG by HKUDS

PyTorch code for retrieval-augmented generation with long-context videos

Created 11 months ago

Updated 2 weeks ago

Starred by

Matei Zaharia

Matei Zaharia(Cofounder of Databricks),

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and

9 more.

LWM by LargeWorldModel

Multimodal autoregressive model for long-context video/text

Created 1 year ago

Updated 1 year ago

Feedback? Help us improve.