evalscope  by modelscope

Evaluation framework for large models

Created 1 year ago
1,690 stars

Top 25.1% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

EvalScope is a comprehensive framework for evaluating and benchmarking diverse large models, including LLMs and multimodal models. It supports various assessment scenarios like RAG, arena mode, and inference performance testing, offering built-in benchmarks and metrics. The framework is designed for researchers and developers needing a streamlined, customizable solution for model evaluation, seamlessly integrating with training frameworks like ms-swift.

How It Works

EvalScope employs a modular architecture with Model Adapters for input standardization, Data Adapters for data processing, and multiple Evaluation Backends. It supports its native backend, OpenCompass, VLMEvalKit for multimodal tasks, and RAGEval for RAG scenarios, alongside third-party integrations like ToolBench. A dedicated Performance Evaluator module measures inference service performance, with results compiled into comprehensive reports and visualizations.

Quick Start & Requirements

  • Installation: pip install evalscope (or pip install 'evalscope[all]' for all backends). Install from source via git clone and pip install -e ..
  • Prerequisites: Python 3.10 recommended. Optional dependencies for specific backends (opencompass, vlmeval, rag, perf, app).
  • Quick Start: Use evalscope eval --model <model_id> --datasets <dataset_names> --limit <num>. Python API available via run_task.
  • Documentation: English Documents

Highlighted Details

  • Supports a wide array of evaluation backends including native, OpenCompass, VLMEvalKit, and RAGEval.
  • Includes a dedicated module for model serving performance evaluation and stress testing.
  • Offers visualization tools for evaluation results and supports integration with wandb and swanlab.
  • Features an "Arena Mode" for pairwise model comparison and evaluation.

Maintenance & Community

  • Actively maintained by ModelScope.
  • Community support via Discord.
  • Roadmap available, indicating ongoing development for distributed evaluation and new benchmarks.

Licensing & Compatibility

  • The specific license is not explicitly stated in the README, but its association with ModelScope suggests a permissive open-source license. Compatibility for commercial use is likely, but requires verification of the exact license.

Limitations & Caveats

  • The project was recently renamed from llmuses to evalscope, requiring users of older versions to update their imports. The exact license is not clearly stated, which could be a consideration for commercial use.
Health Check
Last Commit

22 hours ago

Responsiveness

1 day

Pull Requests (30d)
29
Issues (30d)
59
Star History
182 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
2 more.

RULER by NVIDIA

0.8%
1k
Evaluation suite for long-context language models research paper
Created 1 year ago
Updated 1 month ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 3 years ago
Updated 1 month ago
Starred by Patrick von Platen Patrick von Platen(Author of Hugging Face Diffusers; Research Engineer at Mistral), Simon Willison Simon Willison(Coauthor of Django), and
16 more.

simple-evals by openai

0.3%
4k
Lightweight library for evaluating language models
Created 1 year ago
Updated 1 month ago
Feedback? Help us improve.