Multi-Modality-Arena  by OpenGVLab

Evaluation platform for large multi-modality models

created 2 years ago
531 stars

Top 60.4% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides an evaluation platform for large vision-language models (LVLMs), enabling side-by-side benchmarking with image inputs, similar to Chatbot Arena. It targets researchers and developers working with LVLMs, offering a structured way to compare model performance on visual question-answering tasks.

How It Works

The platform operates using a distributed architecture comprising a controller, model workers, and a Gradio web server. Model workers host individual LVLMs and register with the controller, which then coordinates them to serve requests from the web UI. This setup allows for flexible deployment and comparison of multiple models simultaneously.

Quick Start & Requirements

  • Install: pip install numpy gradio uvicorn fastapi
  • Prerequisites: Python 3.10, Conda environment (conda create -n arena python=3.10). Each model may require specific package versions, necessitating separate environments.
  • Launch: Run controller.py, then model_worker.py --model-name SELECTED_MODEL --device TARGET_DEVICE for each model, and finally server_demo.py.
  • Links: LVLM-eHub, Tiny LVLM Evaluation, OmniMedVQA

Highlighted Details

  • Supports 12+ LVLMs including MiniGPT-4, LLaVA, BLIP-2, and Google Bard.
  • Features the OmniMedVQA benchmark for medical LVLMs, with 118,010 images and 127,995 QA items.
  • Includes a leaderboard categorizing datasets by abilities like visual reasoning and object hallucination.
  • Employs a ChatGPT Ensemble Evaluation for improved agreement with human judgment.

Maintenance & Community

The project is actively updated, with recent additions including the OmniMedVQA benchmark and ability-level dataset splits for LVLM-eHub. Contributions to evaluation and arena integration are welcomed. Contact is available via email (xupeng@pjlab.org.cn) and WeChat.

Licensing & Compatibility

The project is released under a non-commercial license.

Limitations & Caveats

This is an experimental research tool intended for non-commercial use only. It has limited safeguards and may produce inappropriate content. Usage is restricted for illegal, harmful, violent, racist, or sexual purposes.

Health Check
Last commit

1 year ago

Responsiveness

1+ week

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Luca Antiga Luca Antiga(CTO of Lightning AI), and
4 more.

helm by stanford-crfm

0.9%
2k
Open-source Python framework for holistic evaluation of foundation models
created 3 years ago
updated 1 day ago
Feedback? Help us improve.