VLMEvalKit by open-compass

Evaluation toolkit for large multi-modality models (LMMs)

Created 2 years ago

3,672 stars

Top 13.1% on SourcePulse

1 Expert Loves This Project

pgarbacki

Cofounder of Fireworks AI

Project Summary

VLMEvalKit is an open-source toolkit designed for the comprehensive evaluation of Large Vision-Language Models (LVLMs). It simplifies the process for researchers and developers to assess LVLMs across a wide array of benchmarks and models, aiming to standardize and reproduce evaluation results with minimal data preparation effort.

How It Works

The toolkit employs a generation-based evaluation approach for all LVLMs, supporting both exact matching and LLM-based answer extraction for scoring. This method allows for a unified evaluation framework across diverse benchmarks, abstracting away the complexities of individual benchmark data handling and inference pipelines.

Quick Start & Requirements

Install: pip install vlmeval
Prerequisites: Specific transformers versions are recommended for different models (e.g., transformers==4.37.0 for LLaVA series, transformers==4.45.0 for Aria). torchvision>=0.16 is recommended for Moondream and Aria. flash-attn installation is recommended for Aria.
Demo: See the provided Python code snippet for single and multiple image generation examples.
Documentation: Quick start guide available at [QuickStart | 快速开始].

Highlighted Details

Supports over 220 LVLMs and 80+ benchmarks, including recent additions like MMMU-Pro, WeMath, and NaturalBench.
Provides official leaderboards and detailed evaluation results for community reference.
Offers a flexible configuration system for custom evaluation settings.
Facilitates community contributions, with opportunities for contributors to be listed on arXiv reports.

Maintenance & Community

Active development with frequent updates on supported models and benchmarks.
Community engagement encouraged via a Discord channel.
Aims to acknowledge and credit community contributions in reports.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README text. Further clarification on licensing is recommended for commercial use or closed-source integration.

Limitations & Caveats

The toolkit's generation-based evaluation may not perfectly replicate exact accuracy numbers from original papers if those papers used different evaluation methodologies (e.g., PPL-based).
Default prompt templates are used, and specific VLM prompt templates might require custom implementation for optimal reproducibility.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

29

Issues (30d)

18

Star History

160 stars in the last 30 days

Explore Similar Projects

MM-Vet by yuweihao

Benchmark for evaluating large multimodal models (LMMs)

Created 2 years ago

Updated 11 months ago

MMVP by tsb0601

Benchmark for multimodal LLM visual capability evaluation

Created 2 years ago

Updated 1 year ago

BLIVA by mlpc-ucsd

Multimodal LLM for text-rich visual question answering (AAAI 2024 paper)

Created 2 years ago

Updated 1 year ago

MMSearch by CaraJ7

Multimodal search engine pipeline and benchmark for large multimodal models (LMMs)

Created 1 year ago

Updated 11 months ago

Starred by

Chip Huyen

Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

Multi-Modality-Arena by OpenGVLab

Evaluation platform for large multi-modality models

Created 2 years ago

Updated 1 year ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Shizhe Diao

Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA).

X-VLM by zengyan-97

Vision-language model for multi-grained alignment (ICML 2022 paper)

Created 4 years ago

Updated 3 years ago

Awesome-LLM-Eval by onejune2018

Curated list for LLM evaluation tools, datasets, and models

Created 2 years ago

Updated 1 month ago

Starred by

Will Brown

Will Brown(Research Lead at Prime Intellect),

Maxime Labonne

Maxime Labonne(Head of Post-Training at Liquid AI), and

3 more.

evalchemy by mlfoundations

LLM evaluation toolkit for post-trained language models

Created 1 year ago

Updated 2 weeks ago

Starred by

Philip Howes

Philip Howes(Cofounder of Baseten).

evalscope by modelscope

Evaluation framework for large models

Created 2 years ago

Updated 3 days ago

Starred by

Lilian Weng

Lilian Weng(Cofounder of Thinking Machines Lab) and

Travis Fischer

Travis Fischer(Founder of Agentic).

lmms-eval by EvolvingLMMs-Lab

LMM evaluation toolkit for text, image, video, and audio tasks

Created 1 year ago

Updated 14 hours ago

Starred by

Eric Zhu

Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research),

Eugene Yan

Eugene Yan(AI Scientist at AWS), and

1 more.

ms-swift by modelscope

SDK for fine-tuning and deploying LLMs/MLLMs

Created 2 years ago

Updated 1 day ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity),

Jasper Zhang

Jasper Zhang(Cofounder of Hyperbolic), and

21 more.

lm-evaluation-harness by EleutherAI

Framework for few-shot language model evaluation

Created 5 years ago

Updated 4 days ago

Feedback? Help us improve.