evalscope  by modelscope

Evaluation framework for large models

Created 2 years ago
2,237 stars

Top 20.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

EvalScope is a comprehensive framework for evaluating and benchmarking diverse large models, including LLMs and multimodal models. It supports various assessment scenarios like RAG, arena mode, and inference performance testing, offering built-in benchmarks and metrics. The framework is designed for researchers and developers needing a streamlined, customizable solution for model evaluation, seamlessly integrating with training frameworks like ms-swift.

How It Works

EvalScope employs a modular architecture with Model Adapters for input standardization, Data Adapters for data processing, and multiple Evaluation Backends. It supports its native backend, OpenCompass, VLMEvalKit for multimodal tasks, and RAGEval for RAG scenarios, alongside third-party integrations like ToolBench. A dedicated Performance Evaluator module measures inference service performance, with results compiled into comprehensive reports and visualizations.

Quick Start & Requirements

  • Installation: pip install evalscope (or pip install 'evalscope[all]' for all backends). Install from source via git clone and pip install -e ..
  • Prerequisites: Python 3.10 recommended. Optional dependencies for specific backends (opencompass, vlmeval, rag, perf, app).
  • Quick Start: Use evalscope eval --model <model_id> --datasets <dataset_names> --limit <num>. Python API available via run_task.
  • Documentation: English Documents

Highlighted Details

  • Supports a wide array of evaluation backends including native, OpenCompass, VLMEvalKit, and RAGEval.
  • Includes a dedicated module for model serving performance evaluation and stress testing.
  • Offers visualization tools for evaluation results and supports integration with wandb and swanlab.
  • Features an "Arena Mode" for pairwise model comparison and evaluation.

Maintenance & Community

  • Actively maintained by ModelScope.
  • Community support via Discord.
  • Roadmap available, indicating ongoing development for distributed evaluation and new benchmarks.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but its association with ModelScope suggests a permissive open-source license. Compatibility for commercial use is likely, but requires verification of the exact license.

Limitations & Caveats

  • The project was recently renamed from llmuses to evalscope, requiring users of older versions to update their imports. The exact license is not clearly stated, which could be a consideration for commercial use.
Health Check
Last Commit

3 days ago

Responsiveness

1 day

Pull Requests (30d)
24
Issues (30d)
54
Star History
146 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.4%
2k
ML model evaluation library for standardized performance reporting
Created 3 years ago
Updated 1 month ago
Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), and
17 more.

simple-evals by openai

0.7%
4k
Lightweight library for evaluating language models
Created 1 year ago
Updated 5 months ago
Feedback? Help us improve.