Evaluation benchmark for multimodal LLMs in video analysis
Top 54.5% on sourcepulse
Video-MME is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in video analysis. It addresses the insufficient exploration of MLLMs' potential in sequential visual data processing by offering a full-spectrum evaluation across diverse video durations, types, and modalities. This benchmark is valuable for researchers and developers aiming to advance and assess MLLMs in video understanding tasks.
How It Works
Video-MME comprises 900 videos totaling 254 hours, with 2,700 human-annotated question-answer pairs. It distinguishes itself through its temporal dimension (short, medium, long videos), diversity in video types (6 primary domains, 30 subfields), breadth in data modalities (video frames, subtitles, audio), and high-quality, novel annotations. The evaluation pipeline involves extracting frames and subtitles, using a standardized prompt format, and then evaluating model responses against ground truth using provided scripts.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project is associated with the MME-Survey and MMBench teams. The primary contact for issues and leaderboard submissions is videomme2024@gmail.com.
Licensing & Compatibility
Video-MME is strictly for academic research use; commercial use is prohibited. Copyright of videos belongs to their respective owners. Distribution, publication, copying, dissemination, or modification of the benchmark without prior approval is forbidden.
Limitations & Caveats
The dataset is restricted to academic research, prohibiting commercial use. Users must comply with strict distribution and modification restrictions. The project relies on external video owners for content, with a process for addressing copyright infringement.
2 months ago
1 day