VLMEvalKit  by open-compass

Evaluation toolkit for large multi-modality models (LMMs)

created 1 year ago
2,820 stars

Top 17.2% on sourcepulse

GitHubView on GitHub
Project Summary

VLMEvalKit is an open-source toolkit designed for the comprehensive evaluation of Large Vision-Language Models (LVLMs). It simplifies the process for researchers and developers to assess LVLMs across a wide array of benchmarks and models, aiming to standardize and reproduce evaluation results with minimal data preparation effort.

How It Works

The toolkit employs a generation-based evaluation approach for all LVLMs, supporting both exact matching and LLM-based answer extraction for scoring. This method allows for a unified evaluation framework across diverse benchmarks, abstracting away the complexities of individual benchmark data handling and inference pipelines.

Quick Start & Requirements

  • Install: pip install vlmeval
  • Prerequisites: Specific transformers versions are recommended for different models (e.g., transformers==4.37.0 for LLaVA series, transformers==4.45.0 for Aria). torchvision>=0.16 is recommended for Moondream and Aria. flash-attn installation is recommended for Aria.
  • Demo: See the provided Python code snippet for single and multiple image generation examples.
  • Documentation: Quick start guide available at [QuickStart | 快速开始].

Highlighted Details

  • Supports over 220 LVLMs and 80+ benchmarks, including recent additions like MMMU-Pro, WeMath, and NaturalBench.
  • Provides official leaderboards and detailed evaluation results for community reference.
  • Offers a flexible configuration system for custom evaluation settings.
  • Facilitates community contributions, with opportunities for contributors to be listed on arXiv reports.

Maintenance & Community

  • Active development with frequent updates on supported models and benchmarks.
  • Community engagement encouraged via a Discord channel.
  • Aims to acknowledge and credit community contributions in reports.

Licensing & Compatibility

  • The repository does not explicitly state a license in the provided README text. Further clarification on licensing is recommended for commercial use or closed-source integration.

Limitations & Caveats

  • The toolkit's generation-based evaluation may not perfectly replicate exact accuracy numbers from original papers if those papers used different evaluation methodologies (e.g., PPL-based).
  • Default prompt templates are used, and specific VLM prompt templates might require custom implementation for optimal reproducibility.
Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
41
Issues (30d)
27
Star History
549 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.