MMBench by open-compass

Benchmark for evaluating multi-modal LVLM capabilities

Created 2 years ago

280 stars

Top 93.0% on SourcePulse

Project Summary

MMBench provides a robust, fine-grained evaluation framework for large vision-language models (LVLMs), addressing limitations of traditional benchmarks. It targets researchers and developers needing to comprehensively assess LVLM capabilities across diverse skills, offering objective and scalable insights into model performance.

How It Works

MMBench features approximately 3000 multiple-choice questions across 20 fine-grained ability dimensions, structured hierarchically (L-1 to L-3) to cover perception and reasoning skills. Evaluation employs a Circular Evaluation strategy, running inference multiple times with shifted answer choices for enhanced robustness. LLM-based choice extractors, primarily ChatGPT, are used to parse free-form VLM outputs into specific multiple-choice answers (A, B, C, D), ensuring objective scoring.

Quick Start & Requirements

Evaluation is performed using the official VLMEvalKit toolkit. Installation involves setting up VLMEvalKit, followed by using Python scripts for data loading and visualization. Model inference is initiated via python run.py --model <your_model_name> --data MMBench_TEST_EN --mode infer. The resulting Excel file can be submitted to the MMBench leaderboard. While specific hardware isn't detailed, VLM evaluation typically requires GPU resources.

Highlighted Details

Evaluates 20 fine-grained ability dimensions across hierarchical levels.
Utilizes multiple-choice format and Circular Evaluation for robust, objective scoring.
LLM-based extractors handle free-form VLM outputs accurately.
Supports both English and Chinese language versions.
Includes CCBench for Chinese culture-related assessments.

Maintenance & Community

Developed by the OpenCompass Community, MMBench benefits from active community involvement. Recent updates include VLMEvalKit, CCBench enhancements, and Chinese language support. A submission system facilitates leaderboard participation.

Licensing & Compatibility

The provided README does not specify the software license. Users should verify licensing terms for integration, particularly for commercial use.

Limitations & Caveats

The evaluation depends on LLM-based choice extractors for parsing VLM outputs, introducing a dependency on external LLM performance. The Circular Evaluation strategy is more demanding than traditional methods, potentially causing significant accuracy drops (10-20%) for existing VLMs.

MMBench by open-compass

Explore Similar Projects

Q-Bench by Q-Future

LAMM by OpenGVLab

ArtifactsBenchmark by Tencent-Hunyuan

Visual-CoT by deepcs233

RLAIF-V by RLHF-V

OneThinker by tulerfeng

FlagEval by flageval-baai

Multi-Modality-Arena by OpenGVLab

Awesome-MLLM-Reasoning-Collection by lwpyh

MathVista by lupantech

Awesome-RL-based-Reasoning-MLLMs by Sun-Haoyuan23

evalscope by modelscope