Discover and explore top open-source AI tools and projects—updated daily.
Evaluation framework for large models
Top 25.1% on SourcePulse
EvalScope is a comprehensive framework for evaluating and benchmarking diverse large models, including LLMs and multimodal models. It supports various assessment scenarios like RAG, arena mode, and inference performance testing, offering built-in benchmarks and metrics. The framework is designed for researchers and developers needing a streamlined, customizable solution for model evaluation, seamlessly integrating with training frameworks like ms-swift.
How It Works
EvalScope employs a modular architecture with Model Adapters for input standardization, Data Adapters for data processing, and multiple Evaluation Backends. It supports its native backend, OpenCompass, VLMEvalKit for multimodal tasks, and RAGEval for RAG scenarios, alongside third-party integrations like ToolBench. A dedicated Performance Evaluator module measures inference service performance, with results compiled into comprehensive reports and visualizations.
Quick Start & Requirements
pip install evalscope
(or pip install 'evalscope[all]'
for all backends). Install from source via git clone
and pip install -e .
.opencompass
, vlmeval
, rag
, perf
, app
).evalscope eval --model <model_id> --datasets <dataset_names> --limit <num>
. Python API available via run_task
.Highlighted Details
wandb
and swanlab
.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
llmuses
to evalscope
, requiring users of older versions to update their imports. The exact license is not clearly stated, which could be a consideration for commercial use.22 hours ago
1 day