Discover and explore top open-source AI tools and projects—updated daily.
MME-BenchmarksNext-generation benchmark for video multimodal AI
New!
Top 79.5% on SourcePulse
Video-MME-v2 is a next-generation benchmark designed to address the saturation of existing video understanding evaluation sets and the gap between leaderboard performance and real-world user experience. It targets researchers and developers of video multimodal large language models (V-MLLMs), providing a more robust and progressive evaluation paradigm to drive higher-quality technical iteration in the field. The benchmark offers a more accurate assessment of V-MLLM capabilities beyond simple accuracy metrics.
How It Works
Video-MME-v2 introduces three key innovations: a progressive, three-level evaluation framework (information aggregation, temporal understanding, complex reasoning); a grouped, non-linear scoring mechanism that assesses capability consistency and reasoning coherence across interrelated questions; and rigorous data annotation involving over 3,300 human-hours from 60+ experts. This approach moves beyond single-question accuracy to evaluate a model's robustness and deeper understanding of temporal dynamics and world knowledge, offering a more nuanced assessment of V-MLLM performance.
Quick Start & Requirements
Evaluation can be performed using VLMEvalKit or a standalone script with HuggingFace Transformers.
git clone https://github.com/open-compass/VLMEvalKit.git && cd VLMEvalKit && pip install -e .python run.py --model <model_name> --data Video-MME-v2_<config> (configurations include frame sampling rates, subtitle usage, and reasoning modes).pip install torch transformers accelerate decord pandas numpy pillow tqdmpython evaluation/test_video_mme_v2.py --model <model_name> --parquet <path_to_test.parquet> --video-dir <path_to_videos> [other args]Highlighted Details
Maintenance & Community
The project is actively maintained, with recent news indicating ongoing development and a focus on driving the next generation of video understanding models. While specific community channels like Discord/Slack are not detailed, the project provides links to a leaderboard for model submission and a project page for further analysis and interaction.
Licensing & Compatibility
Video-MME-v2 is strictly for academic research purposes; commercial use in any form is prohibited. Distribution, publication, copying, dissemination, or modification of the dataset without prior approval is forbidden. The copyright of all videos belongs to their respective owners.
Limitations & Caveats
The current "Thinking Mode" in V-MLLMs can lead to performance regressions, particularly when subtitles are not used. Even state-of-the-art models exhibit significant room for improvement across various video understanding dimensions, especially in fine-grained action semantics and physical world reasoning. The dataset is restricted to non-commercial use, limiting its applicability for commercial product development.
23 hours ago
Inactive