evalscope by modelscope

Evaluation framework for large models

Created 2 years ago

2,237 stars

Top 20.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Philip Howes

Cofounder of Baseten

Project Summary

EvalScope is a comprehensive framework for evaluating and benchmarking diverse large models, including LLMs and multimodal models. It supports various assessment scenarios like RAG, arena mode, and inference performance testing, offering built-in benchmarks and metrics. The framework is designed for researchers and developers needing a streamlined, customizable solution for model evaluation, seamlessly integrating with training frameworks like ms-swift.

How It Works

EvalScope employs a modular architecture with Model Adapters for input standardization, Data Adapters for data processing, and multiple Evaluation Backends. It supports its native backend, OpenCompass, VLMEvalKit for multimodal tasks, and RAGEval for RAG scenarios, alongside third-party integrations like ToolBench. A dedicated Performance Evaluator module measures inference service performance, with results compiled into comprehensive reports and visualizations.

Quick Start & Requirements

Installation: pip install evalscope (or pip install 'evalscope[all]' for all backends). Install from source via git clone and pip install -e ..
Prerequisites: Python 3.10 recommended. Optional dependencies for specific backends (opencompass, vlmeval, rag, perf, app).
Quick Start: Use evalscope eval --model <model_id> --datasets <dataset_names> --limit <num>. Python API available via run_task.
Documentation: English Documents

Highlighted Details

Supports a wide array of evaluation backends including native, OpenCompass, VLMEvalKit, and RAGEval.
Includes a dedicated module for model serving performance evaluation and stress testing.
Offers visualization tools for evaluation results and supports integration with wandb and swanlab.
Features an "Arena Mode" for pairwise model comparison and evaluation.

Maintenance & Community

Actively maintained by ModelScope.
Community support via Discord.
Roadmap available, indicating ongoing development for distributed evaluation and new benchmarks.

Licensing & Compatibility

The specific license is not explicitly stated in the README, but its association with ModelScope suggests a permissive open-source license. Compatibility for commercial use is likely, but requires verification of the exact license.

Limitations & Caveats

The project was recently renamed from llmuses to evalscope, requiring users of older versions to update their imports. The exact license is not clearly stated, which could be a consideration for commercial use.

Health Check

Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

146 stars in the last 30 days