Multimodal LLM benchmark using multiple-choice questions
Top 81.4% on sourcepulse
SEED-Bench provides a comprehensive suite of benchmarks for evaluating Multimodal Large Language Models (MLLMs). It addresses the need for standardized, multi-dimensional assessment across various visual and textual understanding tasks, targeting researchers and developers building and testing MLLMs. The benefit is a robust framework for comparing model performance and identifying areas for improvement.
How It Works
SEED-Bench utilizes a multiple-choice question format with human-annotated answers, covering a wide array of dimensions. The benchmark is structured into several versions (SEED-Bench-1, SEED-Bench-2, SEED-Bench-2-Plus, and SEED-Bench-H), each expanding on the number of questions, dimensions, and specific task focuses like text-rich visual comprehension or integrated multimodal capabilities. This approach ensures a granular evaluation of MLLMs' abilities in areas such as spatial-temporal understanding, chart interpretation, and image generation.
Quick Start & Requirements
python eval.py --model instruct_blip --anno_path SEED-Bench.json --output-dir results
.INSTALL.md
and DATASET.md
.Highlighted Details
Maintenance & Community
The project has seen active development with releases of multiple benchmark versions and integration into community platforms like OpenCompass. The primary contributors are listed as Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan.
Licensing & Compatibility
SEED-Bench is released under the Apache License Version 2.0. The underlying datasets used for SEED-Bench have various licenses, primarily CC-BY, with some specific dataset licenses noted (e.g., Conceptual Captions, PlotQA, ScienceQA). Compatibility for commercial use is generally permissive under Apache 2.0, but users should verify the licenses of the individual datasets if redistributing or using them directly.
Limitations & Caveats
The project relies on data from various sources, each with its own license and potential copyright holders; users must ensure compliance. While the benchmark is extensive, specific performance claims are tied to the evaluation methodology and model implementations used for leaderboard submissions.
6 months ago
1 day