MLVU  by JUNJIE99

Comprehensive benchmark for long video understanding

Created 2 years ago
256 stars

Top 98.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

MLVU (Multi-task Long Video Understanding Benchmark) addresses the critical need for standardized evaluation of Multimodal Large Language Models (MLLMs) on long-form video comprehension. It provides a comprehensive benchmark comprising diverse long videos, ranging from 3 minutes to 2 hours, and nine distinct tasks. This enables researchers to assess and advance MLLMs' capabilities in understanding complex visual narratives, highlighting current limitations in state-of-the-art models and serving as a catalyst for future development in this domain.

How It Works

MLVU is constructed from a wide variety of long videos, with lengths ranging from 3 minutes to 2 hours, and features nine distinct evaluation tasks. These tasks are specifically designed to challenge MLLMs across three categories: holistic understanding, single-detail comprehension, and multi-detail analysis. They encompass both multiple-choice questions (MLVU M ) and free-form generation tasks. The benchmark utilizes annotation files and provides access to raw videos via a Hugging Face link, with videos pre-processed (resolution reduction, clipping) to respect copyrights. Evaluation is streamlined through integration with the lmms-eval framework, facilitating convenient assessment of multiple-choice questions and maintaining leaderboards for both development and test sets.

Quick Start & Requirements

The MLVU benchmark itself does not require a specific installation command but relies on external tools for evaluation. Annotation files are available, and raw video data can be accessed via a Hugging Face link. For evaluation, integration with lmms-eval is recommended, allowing for convenient assessment of multiple-choice questions. Specific hardware or software prerequisites beyond standard MLLM development environments are not detailed, but users must agree to the dataset's license terms before accessing the data.

Highlighted Details

  • Presents the first comprehensive benchmark specifically for Multi-task Long Video Understanding (LVU), designed to push the boundaries of MLLM capabilities.
  • Features nine distinct tasks categorized into holistic understanding, single-detail comprehension, and multi-detail analysis, offering a broad evaluation spectrum.
  • Evaluations of 20 popular MLLMs, including GPT-4o, reveal significant challenges in LVU, with even top performers like GPT-4o achieving only 64.6% on multi-choice tasks, underscoring the need for improvements in context length, image understanding, and LLM backbones.
  • Maintains leaderboards for both MLVU-Dev and MLVU-Test sets, tracking performance across numerous models and providing a competitive landscape.
  • The MLVU-Test set includes newly added, more challenging tasks such as Sports Question Answering (SQA) and Tutorial Question Answering (TQA), expanding the benchmark's scope.

Maintenance & Community

MLVU has been migrated to a new repository for improved maintenance and updates, with the project team encouraging users to raise issues for support and collaboration. Specific community channels like Discord or Slack are not mentioned in the provided documentation.

Licensing & Compatibility

The MLVU dataset is licensed under CC-BY-NC-SA-4.0. This license strictly restricts usage to research purposes only and prohibits any commercial or other non-research-related applications. Users must agree to these terms before accessing the dataset, and they assume all responsibility for any use beyond research.

Limitations & Caveats

The primary limitation is the strict non-commercial use clause of the CC-BY-NC-SA-4.0 license, rendering the dataset unsuitable for commercial applications. Additionally, the project does not own the copyright to the raw video files, which are provided under specific conditions and are subject to removal requests, potentially impacting dataset availability. The benchmark focuses solely on visual content, excluding audio analysis.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.