Video-MME by MME-Benchmarks

Evaluation benchmark for multimodal LLMs in video analysis

Created 1 year ago

708 stars

Top 48.4% on SourcePulse

Project Summary

Video-MME is a comprehensive benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in video analysis. It addresses the insufficient exploration of MLLMs' potential in sequential visual data processing by offering a full-spectrum evaluation across diverse video durations, types, and modalities. This benchmark is valuable for researchers and developers aiming to advance and assess MLLMs in video understanding tasks.

How It Works

Video-MME comprises 900 videos totaling 254 hours, with 2,700 human-annotated question-answer pairs. It distinguishes itself through its temporal dimension (short, medium, long videos), diversity in video types (6 primary domains, 30 subfields), breadth in data modalities (video frames, subtitles, audio), and high-quality, novel annotations. The evaluation pipeline involves extracting frames and subtitles, using a standardized prompt format, and then evaluating model responses against ground truth using provided scripts.

Quick Start & Requirements

Installation: No explicit installation instructions are provided for the benchmark itself, but scripts for evaluation are available.
Prerequisites: Access to the dataset (900 videos, 744 subtitles) is required. Evaluation scripts are in Python.
Resources: Processing 254 hours of video data will require significant storage and computational resources for frame extraction and model inference.
Links: Project Page, arXiv Paper, Dataset, MME-Survey, Leaderboard

Highlighted Details

First-ever comprehensive evaluation benchmark for MLLMs in video analysis, accepted by CVPR 2025.
Covers video durations from 11 seconds to 1 hour, spanning short, medium, and long-term contexts.
Includes multi-modal inputs like subtitles and audio alongside video frames.
Used by OpenAI GPT-4o as an "industry standard measure" for long context ability.

Maintenance & Community

The project is associated with the MME-Survey and MMBench teams. The primary contact for issues and leaderboard submissions is videomme2024@gmail.com.

Licensing & Compatibility

Video-MME is strictly for academic research use; commercial use is prohibited. Copyright of videos belongs to their respective owners. Distribution, publication, copying, dissemination, or modification of the benchmark without prior approval is forbidden.

Limitations & Caveats

The dataset is restricted to academic research, prohibiting commercial use. Users must comply with strict distribution and modification restrictions. The project relies on external video owners for content, with a process for addressing copyright infringement.

Video-MME by MME-Benchmarks

Explore Similar Projects

MiraData by mira-space

Awesome_Long_Form_Video_Understanding by ttengwang

ml-slowfast-llava by apple

MotionLLM by IDEA-Research

LongVA by EvolvingLMMs-Lab

tarsier by bytedance

MiniGPT4-video by Vision-CAIR

VideoLLaMA2 by DAMO-NLP-SG

Video-ChatGPT by mbzuai-oryx

InternVideo by OpenGVLab

VITA by VITA-MLLM

LWM by LargeWorldModel