Multi-Modality-Arena by OpenGVLab

Evaluation platform for large multi-modality models

Created 2 years ago

555 stars

Top 57.6% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Project Summary

This project provides an evaluation platform for large vision-language models (LVLMs), enabling side-by-side benchmarking with image inputs, similar to Chatbot Arena. It targets researchers and developers working with LVLMs, offering a structured way to compare model performance on visual question-answering tasks.

How It Works

The platform operates using a distributed architecture comprising a controller, model workers, and a Gradio web server. Model workers host individual LVLMs and register with the controller, which then coordinates them to serve requests from the web UI. This setup allows for flexible deployment and comparison of multiple models simultaneously.

Quick Start & Requirements

Install: pip install numpy gradio uvicorn fastapi
Prerequisites: Python 3.10, Conda environment (conda create -n arena python=3.10). Each model may require specific package versions, necessitating separate environments.
Launch: Run controller.py, then model_worker.py --model-name SELECTED_MODEL --device TARGET_DEVICE for each model, and finally server_demo.py.
Links: LVLM-eHub, Tiny LVLM Evaluation, OmniMedVQA

Highlighted Details

Supports 12+ LVLMs including MiniGPT-4, LLaVA, BLIP-2, and Google Bard.
Features the OmniMedVQA benchmark for medical LVLMs, with 118,010 images and 127,995 QA items.
Includes a leaderboard categorizing datasets by abilities like visual reasoning and object hallucination.
Employs a ChatGPT Ensemble Evaluation for improved agreement with human judgment.

Maintenance & Community

The project is actively updated, with recent additions including the OmniMedVQA benchmark and ability-level dataset splits for LVLM-eHub. Contributions to evaluation and arena integration are welcomed. Contact is available via email (xupeng@pjlab.org.cn) and WeChat.

Licensing & Compatibility

The project is released under a non-commercial license.

Limitations & Caveats

This is an experimental research tool intended for non-commercial use only. It has limited safeguards and may produce inappropriate content. Usage is restricted for illegal, harmful, violent, racist, or sexual purposes.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days