Discover and explore top open-source AI tools and projects—updated daily.
open-compassBenchmark for evaluating multi-modal LVLM capabilities
Top 97.0% on SourcePulse
MMBench provides a robust, fine-grained evaluation framework for large vision-language models (LVLMs), addressing limitations of traditional benchmarks. It targets researchers and developers needing to comprehensively assess LVLM capabilities across diverse skills, offering objective and scalable insights into model performance.
How It Works
MMBench features approximately 3000 multiple-choice questions across 20 fine-grained ability dimensions, structured hierarchically (L-1 to L-3) to cover perception and reasoning skills. Evaluation employs a Circular Evaluation strategy, running inference multiple times with shifted answer choices for enhanced robustness. LLM-based choice extractors, primarily ChatGPT, are used to parse free-form VLM outputs into specific multiple-choice answers (A, B, C, D), ensuring objective scoring.
Quick Start & Requirements
Evaluation is performed using the official VLMEvalKit toolkit. Installation involves setting up VLMEvalKit, followed by using Python scripts for data loading and visualization. Model inference is initiated via python run.py --model <your_model_name> --data MMBench_TEST_EN --mode infer. The resulting Excel file can be submitted to the MMBench leaderboard. While specific hardware isn't detailed, VLM evaluation typically requires GPU resources.
Highlighted Details
Maintenance & Community
Developed by the OpenCompass Community, MMBench benefits from active community involvement. Recent updates include VLMEvalKit, CCBench enhancements, and Chinese language support. A submission system facilitates leaderboard participation.
Licensing & Compatibility
The provided README does not specify the software license. Users should verify licensing terms for integration, particularly for commercial use.
Limitations & Caveats
The evaluation depends on LLM-based choice extractors for parsing VLM outputs, introducing a dependency on external LLM performance. The Circular Evaluation strategy is more demanding than traditional methods, potentially causing significant accuracy drops (10-20%) for existing VLMs.
5 months ago
1 week