Discover and explore top open-source AI tools and projects—updated daily.
JUNJIE99Comprehensive benchmark for long video understanding
Top 98.5% on SourcePulse
Summary
MLVU (Multi-task Long Video Understanding Benchmark) addresses the critical need for standardized evaluation of Multimodal Large Language Models (MLLMs) on long-form video comprehension. It provides a comprehensive benchmark comprising diverse long videos, ranging from 3 minutes to 2 hours, and nine distinct tasks. This enables researchers to assess and advance MLLMs' capabilities in understanding complex visual narratives, highlighting current limitations in state-of-the-art models and serving as a catalyst for future development in this domain.
How It Works
MLVU is constructed from a wide variety of long videos, with lengths ranging from 3 minutes to 2 hours, and features nine distinct evaluation tasks. These tasks are specifically designed to challenge MLLMs across three categories: holistic understanding, single-detail comprehension, and multi-detail analysis. They encompass both multiple-choice questions (MLVU M ) and free-form generation tasks. The benchmark utilizes annotation files and provides access to raw videos via a Hugging Face link, with videos pre-processed (resolution reduction, clipping) to respect copyrights. Evaluation is streamlined through integration with the lmms-eval framework, facilitating convenient assessment of multiple-choice questions and maintaining leaderboards for both development and test sets.
Quick Start & Requirements
The MLVU benchmark itself does not require a specific installation command but relies on external tools for evaluation. Annotation files are available, and raw video data can be accessed via a Hugging Face link. For evaluation, integration with lmms-eval is recommended, allowing for convenient assessment of multiple-choice questions. Specific hardware or software prerequisites beyond standard MLLM development environments are not detailed, but users must agree to the dataset's license terms before accessing the data.
Highlighted Details
Maintenance & Community
MLVU has been migrated to a new repository for improved maintenance and updates, with the project team encouraging users to raise issues for support and collaboration. Specific community channels like Discord or Slack are not mentioned in the provided documentation.
Licensing & Compatibility
The MLVU dataset is licensed under CC-BY-NC-SA-4.0. This license strictly restricts usage to research purposes only and prohibits any commercial or other non-research-related applications. Users must agree to these terms before accessing the dataset, and they assume all responsibility for any use beyond research.
Limitations & Caveats
The primary limitation is the strict non-commercial use clause of the CC-BY-NC-SA-4.0 license, rendering the dataset unsuitable for commercial applications. Additionally, the project does not own the copyright to the raw video files, which are provided under specific conditions and are subject to removal requests, potentially impacting dataset availability. The benchmark focuses solely on visual content, excluding audio analysis.
1 month ago
1 day