Evaluation platform for large multi-modality models
Top 60.4% on sourcepulse
This project provides an evaluation platform for large vision-language models (LVLMs), enabling side-by-side benchmarking with image inputs, similar to Chatbot Arena. It targets researchers and developers working with LVLMs, offering a structured way to compare model performance on visual question-answering tasks.
How It Works
The platform operates using a distributed architecture comprising a controller, model workers, and a Gradio web server. Model workers host individual LVLMs and register with the controller, which then coordinates them to serve requests from the web UI. This setup allows for flexible deployment and comparison of multiple models simultaneously.
Quick Start & Requirements
pip install numpy gradio uvicorn fastapi
conda create -n arena python=3.10
). Each model may require specific package versions, necessitating separate environments.controller.py
, then model_worker.py --model-name SELECTED_MODEL --device TARGET_DEVICE
for each model, and finally server_demo.py
.Highlighted Details
Maintenance & Community
The project is actively updated, with recent additions including the OmniMedVQA benchmark and ability-level dataset splits for LVLM-eHub. Contributions to evaluation and arena integration are welcomed. Contact is available via email (xupeng@pjlab.org.cn
) and WeChat.
Licensing & Compatibility
The project is released under a non-commercial license.
Limitations & Caveats
This is an experimental research tool intended for non-commercial use only. It has limited safeguards and may produce inappropriate content. Usage is restricted for illegal, harmful, violent, racist, or sexual purposes.
1 year ago
1+ week