SEED-Bench by AILab-CVC

Multimodal LLM benchmark using multiple-choice questions

Created 2 years ago

358 stars

Top 78.3% on SourcePulse

View on GitHub

2 Experts Love This Project

Jinze Bai

Research Scientist at Alibaba Qwen

Junyang Lin

Core Maintainer at Alibaba Qwen

Project Summary

SEED-Bench provides a comprehensive suite of benchmarks for evaluating Multimodal Large Language Models (MLLMs). It addresses the need for standardized, multi-dimensional assessment across various visual and textual understanding tasks, targeting researchers and developers building and testing MLLMs. The benefit is a robust framework for comparing model performance and identifying areas for improvement.

How It Works

SEED-Bench utilizes a multiple-choice question format with human-annotated answers, covering a wide array of dimensions. The benchmark is structured into several versions (SEED-Bench-1, SEED-Bench-2, SEED-Bench-2-Plus, and SEED-Bench-H), each expanding on the number of questions, dimensions, and specific task focuses like text-rich visual comprehension or integrated multimodal capabilities. This approach ensures a granular evaluation of MLLMs' abilities in areas such as spatial-temporal understanding, chart interpretation, and image generation.

Quick Start & Requirements

Install/Run: Evaluation code is available. A sample command is python eval.py --model instruct_blip --anno_path SEED-Bench.json --output-dir results.
Prerequisites: Python, specific dataset downloads from HuggingFace or ModelScope. Detailed installation and data preparation instructions are in INSTALL.md and DATASET.md.
Resources: Requires downloading datasets; specific hardware requirements are not detailed but MLLM evaluation typically demands significant computational resources.
Links: SEED-Bench Leaderboard, SEED-Bench, SEED-Bench-2, SEED-Bench-2-Plus, SEED-Bench-H.

Highlighted Details

Includes over 28K multiple-choice questions across 34 dimensions in its latest iteration (SEED-Bench-H).
Covers diverse tasks including spatial-temporal understanding, chart/map comprehension, text-rich scenarios, and image/text generation evaluation.
Supports evaluation via VLMEvalKit and has been integrated into OpenCompass-Dataset Community.
Provides comprehensive evaluation results for leading models like GPT-4v, Gemini-Vision-Pro, and Claude-3-Opus on its leaderboard.

Maintenance & Community

The project has seen active development with releases of multiple benchmark versions and integration into community platforms like OpenCompass. The primary contributors are listed as Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan.

Licensing & Compatibility

SEED-Bench is released under the Apache License Version 2.0. The underlying datasets used for SEED-Bench have various licenses, primarily CC-BY, with some specific dataset licenses noted (e.g., Conceptual Captions, PlotQA, ScienceQA). Compatibility for commercial use is generally permissive under Apache 2.0, but users should verify the licenses of the individual datasets if redistributing or using them directly.

Limitations & Caveats

The project relies on data from various sources, each with its own license and potential copyright holders; users must ensure compliance. While the benchmark is extensive, specific performance claims are tied to the evaluation methodology and model implementations used for leaderboard submissions.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days