Video-MME-v2 by MME-Benchmarks

Next-generation benchmark for video multimodal AI

Created 3 months ago

369 stars

Top 76.3% on SourcePulse

Project Summary

Video-MME-v2 is a next-generation benchmark designed to address the saturation of existing video understanding evaluation sets and the gap between leaderboard performance and real-world user experience. It targets researchers and developers of video multimodal large language models (V-MLLMs), providing a more robust and progressive evaluation paradigm to drive higher-quality technical iteration in the field. The benchmark offers a more accurate assessment of V-MLLM capabilities beyond simple accuracy metrics.

How It Works

Video-MME-v2 introduces three key innovations: a progressive, three-level evaluation framework (information aggregation, temporal understanding, complex reasoning); a grouped, non-linear scoring mechanism that assesses capability consistency and reasoning coherence across interrelated questions; and rigorous data annotation involving over 3,300 human-hours from 60+ experts. This approach moves beyond single-question accuracy to evaluate a model's robustness and deeper understanding of temporal dynamics and world knowledge, offering a more nuanced assessment of V-MLLM performance.

Quick Start & Requirements

Evaluation can be performed using VLMEvalKit or a standalone script with HuggingFace Transformers.

VLMEvalKit:
- Install: git clone https://github.com/open-compass/VLMEvalKit.git && cd VLMEvalKit && pip install -e .
- Run: python run.py --model <model_name> --data Video-MME-v2_<config> (configurations include frame sampling rates, subtitle usage, and reasoning modes).
Standalone Transformers:
- Dependencies: pip install torch transformers accelerate decord pandas numpy pillow tqdm
- Run: python evaluation/test_video_mme_v2.py --model <model_name> --parquet <path_to_test.parquet> --video-dir <path_to_videos> [other args]
Dataset: Hosted on Hugging Face (MME-Benchmarks/Video-MME-v2), containing 800 videos, subtitle files, and 3,200 question-answer pairs.
Links: Project Page, Paper, Dataset, Leaderboard.

Highlighted Details

Non-linear scoring reveals significant gaps between average accuracy and group-based scores for SOTA models, indicating robustness issues.
Enabling "Thinking Mode" with subtitles generally improves performance, but can cause regressions without subtitles, highlighting areas for improvement in current reasoning mechanisms.
Models show varying strengths across dimensions; Gemini-3-Pro demonstrates strong performance in audio integration and long-horizon temporal reasoning, while areas like action semantics and physical world reasoning remain challenging for current models.

Maintenance & Community

The project is actively maintained, with recent news indicating ongoing development and a focus on driving the next generation of video understanding models. While specific community channels like Discord/Slack are not detailed, the project provides links to a leaderboard for model submission and a project page for further analysis and interaction.

Licensing & Compatibility

Video-MME-v2 is strictly for academic research purposes; commercial use in any form is prohibited. Distribution, publication, copying, dissemination, or modification of the dataset without prior approval is forbidden. The copyright of all videos belongs to their respective owners.

Limitations & Caveats

The current "Thinking Mode" in V-MLLMs can lead to performance regressions, particularly when subtitles are not used. Even state-of-the-art models exhibit significant room for improvement across various video understanding dimensions, especially in fine-grained action semantics and physical world reasoning. The dataset is restricted to non-commercial use, limiting its applicability for commercial product development.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days