MMBench  by open-compass

Benchmark for evaluating multi-modal LVLM capabilities

Created 2 years ago
263 stars

Top 97.0% on SourcePulse

GitHubView on GitHub
Project Summary

MMBench provides a robust, fine-grained evaluation framework for large vision-language models (LVLMs), addressing limitations of traditional benchmarks. It targets researchers and developers needing to comprehensively assess LVLM capabilities across diverse skills, offering objective and scalable insights into model performance.

How It Works

MMBench features approximately 3000 multiple-choice questions across 20 fine-grained ability dimensions, structured hierarchically (L-1 to L-3) to cover perception and reasoning skills. Evaluation employs a Circular Evaluation strategy, running inference multiple times with shifted answer choices for enhanced robustness. LLM-based choice extractors, primarily ChatGPT, are used to parse free-form VLM outputs into specific multiple-choice answers (A, B, C, D), ensuring objective scoring.

Quick Start & Requirements

Evaluation is performed using the official VLMEvalKit toolkit. Installation involves setting up VLMEvalKit, followed by using Python scripts for data loading and visualization. Model inference is initiated via python run.py --model <your_model_name> --data MMBench_TEST_EN --mode infer. The resulting Excel file can be submitted to the MMBench leaderboard. While specific hardware isn't detailed, VLM evaluation typically requires GPU resources.

Highlighted Details

  • Evaluates 20 fine-grained ability dimensions across hierarchical levels.
  • Utilizes multiple-choice format and Circular Evaluation for robust, objective scoring.
  • LLM-based extractors handle free-form VLM outputs accurately.
  • Supports both English and Chinese language versions.
  • Includes CCBench for Chinese culture-related assessments.

Maintenance & Community

Developed by the OpenCompass Community, MMBench benefits from active community involvement. Recent updates include VLMEvalKit, CCBench enhancements, and Chinese language support. A submission system facilitates leaderboard participation.

Licensing & Compatibility

The provided README does not specify the software license. Users should verify licensing terms for integration, particularly for commercial use.

Limitations & Caveats

The evaluation depends on LLM-based choice extractors for parsing VLM outputs, introducing a dependency on external LLM performance. The Circular Evaluation strategy is more demanding than traditional methods, potentially causing significant accuracy drops (10-20%) for existing VLMs.

Health Check
Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
10 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.